Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

arXiv cs.CL 05/21/26, 04:00 AM Papers
in-context-learning task-vectors distributional-alignment linear-task-vector llm efficiency
Summary
This paper proposes using distributional alignment between task vector-based and in-context learning inference as a criterion for designing task vectors, and introduces Linear Task Vector (LTV) that minimizes next-token probability discrepancy via closed-form linear mapping. LTV achieves 9.2% average accuracy improvement over baselines across eight benchmarks and five LLMs.
arXiv:2605.20730v1 Announce Type: new Abstract: In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.
Original Article
View Cached Full Text
Cached at: 05/21/26, 06:35 AM
# Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning
Source: [https://arxiv.org/html/2605.20730](https://arxiv.org/html/2605.20730)
Jihoon Kwon Seoul National University kog0712@snu\.ac\.kr&Jiwon Choi∗ Yonsei University jiii111@yonsei\.ac\.kr&Jy\-yong Sohn† Yonsei University jysohn1108@yonsei\.ac\.kr

###### Abstract

In\-context learning \(ICL\) allows large language models \(LLMs\) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases\. While task vectors offer a promising alternative by compressing demonstrations into compact hidden\-state representations, their quality has been evaluated only through downstream task accuracy\. This indirect criterion provides limited insight into how to design more effective task vector extraction methods\. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL\. To quantify this, we introducedNTPd\_\{\\text\{NTP\}\}, a metric that measures the discrepancy in next\-token probabilities between task vector\-based and ICL\-based inference\. Our empirical analysis reveals thatdNTPd\_\{\\text\{NTP\}\}serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy\. Motivated by this, we develop Linear Task Vector \(LTV\), a method designed to minimizedNTPd\_\{\\text\{NTP\}\}via a closed\-form linear mapping that estimates demonstration effects through regression\. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9\.2% while reducing inference latency\. We further show that LTV outperforms the baselines on regression tasks\. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research\. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model’s performance by 6\.4%, suggesting a new utility for extracted task representations\.

## 1Introduction

In\-Context Learning \(ICL\) has emerged as a powerful paradigm for adapting large language models \(LLMs\) to new tasks by simply prepending labeled demonstrations before the query\[[6](https://arxiv.org/html/2605.20730#bib.bib46),[47](https://arxiv.org/html/2605.20730#bib.bib68)\]\. ICL has been shown to achieve impressive performance gain across various tasks without requiring any model parameter updates, with performance typically improving as more demonstrations are provided\[[2](https://arxiv.org/html/2605.20730#bib.bib65),[9](https://arxiv.org/html/2605.20730#bib.bib73)\]\. However, this improvement comes at a cost: longer demonstrations either increase inference\-time computation as the input length grows, or incur memory overhead when caching the activations\[[32](https://arxiv.org/html/2605.20730#bib.bib81),[28](https://arxiv.org/html/2605.20730#bib.bib31),[12](https://arxiv.org/html/2605.20730#bib.bib62),[14](https://arxiv.org/html/2605.20730#bib.bib82)\]\. These computational and memory limitations hinder the practical utility of ICL under resource constraints\.

To address these limitations, recent work has proposed*task vectors in ICL*as a training\-free approach that enables task adaptation without directly using demonstrations at inference time\[[16](https://arxiv.org/html/2605.20730#bib.bib29),[41](https://arxiv.org/html/2605.20730#bib.bib32),[29](https://arxiv.org/html/2605.20730#bib.bib33),[24](https://arxiv.org/html/2605.20730#bib.bib41)\]\. In this approach, a task vector \(TV\) refers to a condensed vector extracted from the internal activations of an LLM when performing ICL, encapsulating the task information that the LLM implicitly infers from demonstrations\[[15](https://arxiv.org/html/2605.20730#bib.bib30),[50](https://arxiv.org/html/2605.20730#bib.bib34),[10](https://arxiv.org/html/2605.20730#bib.bib67)\]\. By applying this vector to zero\-shot inference, where no demonstrations are provided, the model achieves substantial performance gains on the task\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]\. While various TV extraction methods have been proposed in the past years, the downstream task performance remains the only established criterion for comparing them, offering limited insight into*why*one method outperforms another and*how*to improve the existing extraction methods\.

In this paper, we posit that distributional alignment with ICL is a desirable property of task vectors, and a useful criterion for designing TV extraction methods\. This perspective is motivated by the following idea: since the role of a task vector is to condense the effect of demonstrations, TV\-based inference should naturally produce a predictive distribution that closely aligns with that of ICL\.

From this perspective, we make the following contributions:

- •We proposedNTPd\_\{\\text\{NTP\}\}, a metric that quantifies the quality of TV methods, by measuring the*discrepancy*between the predictive distribution under TV\-based inference and that under ICL\-based inference, in terms of*next\-token probability \(NTP\)*\. We empirically show thatdNTPd\_\{\\text\{NTP\}\}has a strong negative correlation with the downstream performance, serving as an indicator of the quality of TV\.
- •We develop the Linear Task Vector \(LTV\) method, which is designed to reducedNTPd\_\{\\text\{NTP\}\}\. Specifically,LTVemploys a linear mapping that estimates the effect of demonstrations, and uses the closed\-form solution of a regression problem to extract task vectors\.
- •In experiments,LTVconsistently outperforms existing TV methods on eight classification benchmarks and five LLMs, improving average accuracy by 9\.2% while reducing inference latency\. Furthermore,LTVoutperforms the baselines on regression tasks\. Finally, we extend our study to the transferability of task vectors, a dimension largely unexplored in existing research\. By applying task vectors extracted from a larger model to smaller ones – which may have limited capacity or context lengths – we achieve a 6\.4% improvement in classification accuracy\.

## 2Related Work

##### Task Vectors\.

A pioneering workIlharcoet al\.\[[19](https://arxiv.org/html/2605.20730#bib.bib6)\]introduces the concept of task vectors, defined as the difference in parameter space between a pre\-trained model and the model fine\-tuned for a specific task\. The core idea is that shifting the weights of a model in the task vector direction improves performance on that task\[[34](https://arxiv.org/html/2605.20730#bib.bib13),[52](https://arxiv.org/html/2605.20730#bib.bib27),[25](https://arxiv.org/html/2605.20730#bib.bib19)\]\. Recent works have shown that task vectors can also be extracted from representation spaces, such as activation spaces\[[16](https://arxiv.org/html/2605.20730#bib.bib29),[50](https://arxiv.org/html/2605.20730#bib.bib34)\]or soft prompt spaces\[[5](https://arxiv.org/html/2605.20730#bib.bib72)\]\.

##### In\-Context Learning\.

ICL enables LLMs to adapt to new tasks by simply prepending query\-label pairs as demonstrations to the model input\[[6](https://arxiv.org/html/2605.20730#bib.bib46)\]\. The success of ICL has motivated various theoretical interpretations\[[55](https://arxiv.org/html/2605.20730#bib.bib74),[44](https://arxiv.org/html/2605.20730#bib.bib44),[3](https://arxiv.org/html/2605.20730#bib.bib66),[33](https://arxiv.org/html/2605.20730#bib.bib16),[51](https://arxiv.org/html/2605.20730#bib.bib15),[4](https://arxiv.org/html/2605.20730#bib.bib25),[27](https://arxiv.org/html/2605.20730#bib.bib24)\]\. A notable line of work interprets ICL as implicit Bayesian inference\[[48](https://arxiv.org/html/2605.20730#bib.bib38),[37](https://arxiv.org/html/2605.20730#bib.bib23),[54](https://arxiv.org/html/2605.20730#bib.bib9)\]: as the model processes demonstrations, it implicitly infers a*latent task concept*from demonstrations and conditions its predictions on the resulting posterior\. This view provides the theoretical foundation for task vectors in ICL\[[31](https://arxiv.org/html/2605.20730#bib.bib77)\], a line of work we describe in detail below\.

##### Task Vectors in ICL\.

A key limitation of ICL is that it incurs substantial inference\-time computation and memory overhead as the input length increases\[[28](https://arxiv.org/html/2605.20730#bib.bib31),[2](https://arxiv.org/html/2605.20730#bib.bib65)\]\. To reduce these overheads, recent works aim to internalize the task adaptation induced by ICL, by adjusting the model parameters or activations so that the effects of demonstrations are encoded within the model itself\. One line of work achieves this through few\-shot parameter\-efficient fine\-tuning \(PEFT\)\[[28](https://arxiv.org/html/2605.20730#bib.bib31),[20](https://arxiv.org/html/2605.20730#bib.bib80),[12](https://arxiv.org/html/2605.20730#bib.bib62),[26](https://arxiv.org/html/2605.20730#bib.bib69)\]\. Another line of work explores task vectors in ICL, offering a training\-free alternative that enables task adaptation\.

Numerous methods have been proposed for extracting task vectors in ICL, demonstrating performance gains over zero\-shot inference\[[16](https://arxiv.org/html/2605.20730#bib.bib29),[41](https://arxiv.org/html/2605.20730#bib.bib32),[29](https://arxiv.org/html/2605.20730#bib.bib33),[24](https://arxiv.org/html/2605.20730#bib.bib41),[28](https://arxiv.org/html/2605.20730#bib.bib31),[21](https://arxiv.org/html/2605.20730#bib.bib4),[46](https://arxiv.org/html/2605.20730#bib.bib3)\]\. As the model activations span multiple modules, existing methods vary widely in where and how task vectors are extracted\. This diversity underscores the need for a direct criterion to evaluate the quality of extracted task vectors\.

## 3Backgrounds

In this section, we first review relevant concepts and notations used in our paper\. Sec\.[3\.1](https://arxiv.org/html/2605.20730#S3.SS1)describes how LLMs predict the next token, Sec\.[3\.2](https://arxiv.org/html/2605.20730#S3.SS2)defines our target classification task, and Sec\.[3\.3](https://arxiv.org/html/2605.20730#S3.SS3)introduces three inference modes for LLMs – zero\-shot, ICL, and using task vectors\.

### 3\.1Model: Large Language Models \(LLMs\)

We consider pre\-trained auto\-regressive LLMs which predict the next tokenuugiven an input promptpp, a sequence of tokens\. The model consists of three components:

- •the embedding layer that converts each token in the promptppinto an embedding vector,
- •the transformer\[[43](https://arxiv.org/html/2605.20730#bib.bib49)\]\(TF\) decoder consisting ofLLlayers, which transforms the sequence of embedding vectors into a sequence of hidden states,
- •the language modeling \(LM\) head that predicts the probability of the next tokenuubased on the output of TF\.

Let\[𝒉1\(p\),𝒉2\(p\),…,𝒉l\(p\)\]\[\{\\bm\{h\}\}\_\{1\}\(p\),\{\\bm\{h\}\}\_\{2\}\(p\),\\dots,\{\\bm\{h\}\}\_\{l\}\(p\)\]denote the hidden states at the final layer of the TF decoder, wherellis the sequence length and𝒉l\(p\)∈ℝd\{\\bm\{h\}\}\_\{l\}\(p\)\\in\\mathbb\{R\}^\{d\}is add\-dimensional vector\. The LM head predicts the next tokenu∈𝒰u\\in\\mathcal\{U\}by applying a linear projection to the last hidden state𝒉l\(p\)\{\\bm\{h\}\}\_\{l\}\(p\), where𝒰=\{1,2,…,N𝒰\}\\mathcal\{U\}=\\\{1,2,\\ldots,N\_\{\\mathcal\{U\}\}\\\}is the vocabulary set; each token is represented by its index\. To be specific, the probability of the next token is computed as:

P\(u∣p\)=σ\(𝑾lm𝒉l\(p\)\)\[u\],u∈𝒰,P\(u\\mid p\)=\\sigma\(\{\\bm\{W\}\}\_\{\\mathrm\{lm\}\}\{\\bm\{h\}\}\_\{l\}\(p\)\)\[u\],\\quad u\\in\\mathcal\{U\},\(1\)where𝑾lm∈ℝN𝒰×d\{\\bm\{W\}\}\_\{\\mathrm\{lm\}\}\\in\\mathbb\{R\}^\{N\_\{\\mathcal\{U\}\}\\times d\}denotes the weight matrix of the LM head, andσ\(⋅\)\\sigma\(\\cdot\)denotes the softmax function\. For notational simplicity, we hereafter write𝒉\(p\)\{\\bm\{h\}\}\(p\)for𝒉l\(p\)\{\\bm\{h\}\}\_\{l\}\(p\), as we only use the hidden state of the last token\. We also writeTF\(p\)\\text\{TF\}\(p\)to denote the hidden state𝒉\(p\)\{\\bm\{h\}\}\(p\)obtained by embedding the promptppand passing it through TF\.

### 3\.2Task: Classification

While large language models \(LLMs\) can be applied to a wide range of downstream tasks, prior work on task vectors in ICL\[[16](https://arxiv.org/html/2605.20730#bib.bib29),[28](https://arxiv.org/html/2605.20730#bib.bib31),[39](https://arxiv.org/html/2605.20730#bib.bib59)\]has primarily focused on classification settings\. Following this line of work, we also restrict our attention to classification tasks\.

We define a classification task by a distribution𝒟\\mathcal\{D\}over query–label pairs\(x,y\)\(x,y\), where the queryxxis a text sequence and the labelyybelongs to a task\-specific label set𝒞⊆𝒰\\mathcal\{C\}\\subseteq\\mathcal\{U\}which contains\|𝒞\|=K\\lvert\\mathcal\{C\}\\rvert=Kclasses\. Given a queryxx, the goal is to predict its corresponding labelyy\. We consider the next\-token distribution*restricted*to the label set𝒞\\mathcal\{C\}:

P\(c∣p;𝒞\)=P\(c∣p\)∑c′∈𝒞P\(c′∣p\),c∈𝒞\.P\(c\\mid p;\\mathcal\{C\}\)=\\frac\{P\(c\\mid p\)\}\{\\sum\_\{c^\{\\prime\}\\in\\mathcal\{C\}\}P\(c^\{\\prime\}\\mid p\)\},\\quad c\\in\\mathcal\{C\}\.\(2\)For notational simplicity, we hereafter writeP\(c∣p\)P\(c\\mid p\)to denote this label\-restricted distribution\. In greedy decoding, the predicted labely^\\hat\{y\}is determined by selecting the class with the highest probability:

y^=argmaxc∈𝒞P\(c∣p\)\.\\hat\{y\}=\\mathrm\{argmax\}\_\{c\\in\\mathcal\{C\}\}\\,P\(c\\mid p\)\.\(3\)
![Refer to caption](https://arxiv.org/html/2605.20730v1/x1.png)Figure 1:Comparison of three inference modes\. In*zero\-shot*inference mode \(left\), the model predicts the next tokeny^zs\\hat\{y\}\_\{\\text\{zs\}\}solely based on the test queryxtestx\_\{\\text\{test\}\}\. In the*In\-Context Learning*mode \(middle\), the model predicts the next tokeny^icl\\hat\{y\}\_\{\\text\{icl\}\}based on the concatenation of demonstrationsZZand the queryxtestx\_\{\\text\{test\}\}\. In the*task vector*mode \(right\), the model predicts the next tokeny^tv\\hat\{y\}\_\{\\text\{tv\}\}based on not only the queryxtestx\_\{\\text\{test\}\}, but also an injected task vector𝒗=f\(Z\)\{\\bm\{v\}\}=f\(Z\), which is added to the model activation\. Here, the task vector𝒗\{\\bm\{v\}\}is constructed by a functionffusing the demonstrationsZZ\.
### 3\.3Methods: Inference Modes for LLMs

We introduce three inference modes for LLMs: zero\-shot inference in Sec\.[3\.3\.1](https://arxiv.org/html/2605.20730#S3.SS3.SSS1), in\-context learning \(ICL\) in Sec\.[3\.3\.2](https://arxiv.org/html/2605.20730#S3.SS3.SSS2), and task vectors in Sec\.[3\.3\.3](https://arxiv.org/html/2605.20730#S3.SS3.SSS3)\.

#### 3\.3\.1Zero\-shot Inference Mode

In the zero\-shot inference mode, the LLM predictsytesty\_\{\\text\{test\}\}for the test queryxtestx\_\{\\text\{test\}\}, without being provided any labeled examples\(x,y\)\(x,y\)for the target task\. The leftmost part of Fig\.[1](https://arxiv.org/html/2605.20730#S3.F1)shows the detailed process\. First, the test queryxtestx\_\{\\text\{test\}\}is passed through the TF to obtain the hidden state𝒉zs=TF\(xtest\)\{\\bm\{h\}\}\_\{\\text\{zs\}\}=\\text\{TF\}\(x\_\{\\text\{test\}\}\)\. Then, the LM head computes the probabilityP\(c∣xtest\)P\(c\\mid x\_\{\\text\{test\}\}\)for each classccfrom𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}\. Finally, the predicted label is selected via greedy decoding:

y^zs=argmaxc∈𝒞P\(c∣xtest\)\.\\hat\{y\}\_\{\\text\{zs\}\}=\\mathrm\{argmax\}\_\{c\\in\\mathcal\{C\}\}\\,P\(c\\mid x\_\{\\text\{test\}\}\)\.\(4\)Throughout the paper, we use the subscript ‘zs’ to note that the quantity is for the zero\-shot inference\.

#### 3\.3\.2In\-Context Learning \(ICL\) Mode

Suppose we are givenkkdemonstrationsZ=\{\(xi,yi\)\}i=1kZ=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{k\}sampled from the task distribution𝒟\\mathcal\{D\}\. As shown in the middle of Fig\.[1](https://arxiv.org/html/2605.20730#S3.F1), ICL prependsZZto the test queryxtestx\_\{\\text\{test\}\}and passes them through the model, which outputs𝒉icl=TF\(\[Z\|\|xtest\]\)\{\\bm\{h\}\}\_\{\\text\{icl\}\}=\\text\{TF\}\(\[Z\|\|x\_\{\\text\{test\}\}\]\), where\|\|\|\|represents the concatenation operator\. The LM head then computes the probabilityP\(c∣\[Z\|\|xtest\]\)P\(c\\mid\[Z\|\|x\_\{\\text\{test\}\}\]\)for each classccfrom𝒉icl\{\\bm\{h\}\}\_\{\\text\{icl\}\}\. We denote this probability byPicl\(c∣xtest,Z\)P\_\{\\text\{icl\}\}\(c\\mid x\_\{\\text\{test\}\},Z\)to indicate the ICL inference mode\. Lastly, the predicted label is then obtained as

y^icl=argmaxc∈𝒞Picl\(c∣xtest,Z\)\.\\hat\{y\}\_\{\\text\{icl\}\}=\\mathrm\{argmax\}\_\{c\\in\\mathcal\{C\}\}P\_\{\\text\{icl\}\}\(c\\mid x\_\{\\text\{test\}\},Z\)\.\(5\)

#### 3\.3\.3Task Vector \(TV\) Mode

Inference using task vectors is a variant of ICL\. This mode is motivated by the implicit Bayesian view\[[48](https://arxiv.org/html/2605.20730#bib.bib38)\], which posits that during ICL, the LLM predicts the labelyyconditioned on a latent task concept𝒗\{\\bm\{v\}\}inferred from demonstrationsZZ\. Under this view, the predictive distribution computed by the LLM under the ICL can be expressed as

P\(y∣x,Z\)=∫𝒗P\(y∣x,𝒗\)P\(𝒗∣Z\)𝑑𝒗\.P\(y\\mid x,Z\)=\\int\_\{\\bm\{v\}\}P\(y\\mid x,\{\\bm\{v\}\}\)\\,P\(\{\\bm\{v\}\}\\mid Z\)\\,d\{\\bm\{v\}\}\.\(6\)whereP\(𝒗∣Z\)P\(\{\\bm\{v\}\}\\mid Z\)denotes the posterior over the latent task concept𝒗\{\\bm\{v\}\}inferred from theZZ, andP\(y∣x,𝒗\)P\(y\\mid x,\{\\bm\{v\}\}\)denotes the predictive distribution of the labelyyconditioned on the inferred concept𝒗\{\\bm\{v\}\}\.

TV mode explicitly makes use of the decomposition of the predictive distribution as in the right\-hand side of equation[6](https://arxiv.org/html/2605.20730#S3.E6)\. In other words, the task vector mode is composed of two stages: \(1\) the*extraction*stage, which extracts a task vector𝒗\{\\bm\{v\}\}from the model activation induced by the demonstrationsZZ, and \(2\) the*inference*stage, which applies the extracted vector𝒗\{\\bm\{v\}\}to the model activation, in the zero\-shot inference\. Below we formally describe each stage, which is shown in the rightmost column of Fig[1](https://arxiv.org/html/2605.20730#S3.F1)\.

##### Extraction of task vector\.

Letffdenote a task vector extraction function that takes demonstrationsZZand queryxxas input, and outputs a task vector𝒗\{\\bm\{v\}\}\. Formally, we represent the task vector𝒗\{\\bm\{v\}\}as

𝒗=f\(x,Z\)\.\{\\bm\{v\}\}=f\(x,Z\)\.\(7\)For notational simplicity, we suppress the dependence of𝒗\{\\bm\{v\}\}onxxandZZwhen it is clear from context\.

##### Inference using task vector\.

Recall that in the zero\-shot inference mode \(specified in Sec\.[3\.3\.1](https://arxiv.org/html/2605.20730#S3.SS3.SSS1)\), TF outputs the hidden state of the last token𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}, when the input is set to the test queryxtestx\_\{\\text\{test\}\}\. In the TV mode, the task vector𝒗\{\\bm\{v\}\}extracted in equation[7](https://arxiv.org/html/2605.20730#S3.E7)is used to update111While our method additively updates the output of TF, there exist other task vector\-based methods that combine a model activation and𝒗\{\\bm\{v\}\}in different ways\. We focus on this additive case for notational simplicity\.the hidden state from𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}to task\-conditioned hidden state𝒉tv=𝒉zs\+𝒗\{\\bm\{h\}\}\_\{\\text\{tv\}\}=\{\\bm\{h\}\}\_\{\\text\{zs\}\}\+\{\\bm\{v\}\}\. For a given𝒉tv\{\\bm\{h\}\}\_\{\\text\{tv\}\}, the LM head computes the probabilityP\(c∣xtest,𝒗\)P\(c\\mid x\_\{\\text\{test\}\},\{\\bm\{v\}\}\)for each classcc; we denote this probability byPtv\(c∣xtest,𝒗\)P\_\{\\text\{tv\}\}\(c\\mid x\_\{\\text\{test\}\},\{\\bm\{v\}\}\)to indicate that this inference mode uses task vectors\. The predicted label is then determined as

y^tv=argmaxc∈𝒞Ptv\(c∣xtest,𝒗\)\.\\hat\{y\}\_\{\\text\{tv\}\}=\\mathrm\{argmax\}\_\{c\\in\\mathcal\{C\}\}P\_\{\\text\{tv\}\}\(c\\mid x\_\{\\text\{test\}\},\{\\bm\{v\}\}\)\.\(8\)
![Refer to caption](https://arxiv.org/html/2605.20730v1/x2.png)Figure 2:Overview of the proposed metricdNTP\(f;Z\)d\_\{\\text\{NTP\}\}\(f;Z\)in equation[9](https://arxiv.org/html/2605.20730#S4.E9), which measures the quality of the task vector extraction methodff\. In the ICL mode, the model gets demonstrationsZZand test queryxtestx\_\{\\text\{test\}\}together to estimate the probability distributionPiclP\_\{\\text\{icl\}\}for the next token \(left\)\. In the TV mode, the task vector𝒗\{\\bm\{v\}\}is injected in the hidden layer \(instead of putting demonstrationsZZin the input layer\) to get the distributionPtvP\_\{\\text\{tv\}\}for the next token \(right\)\. Our proposed metric measures the expected Kullback\-Leibler \(KL\) divergence between these two distributions \(PiclP\_\{\\text\{icl\}\}andPtvP\_\{\\text\{tv\}\}\), thus checking whether the effect of using𝒗=f\(Z\)\{\\bm\{v\}\}=f\(Z\)is equivalent to the effect of usingZZin the input\.

## 4Measuring the Quality of Task Vector

We propose a metric that measures the quality of the task vector extracted byff\. We first provide the motivation of our approach in Sec\.[4\.1](https://arxiv.org/html/2605.20730#S4.SS1), and then propose our metric in Sec\.[4\.2](https://arxiv.org/html/2605.20730#S4.SS2)\. Finally, we show that our metric serves as an indicator of task vector quality in Sec\.[4\.3](https://arxiv.org/html/2605.20730#S4.SS3)\.

### 4\.1Motivation

Recall that the goal of using task vectors is to condense the task information inferred during ICL into a compact vector𝒗\{\\bm\{v\}\}, enabling predictions similar to ICL without the overhead of processing demonstrations\. However, prior work has relied solely on downstream task accuracy to evaluate the quality of the task vector\. This evaluation practice offers limited insight into*why*one method outperforms another, and provides little guidance on*how*to design better task vector methods\.

To address this, we propose a metric grounded in the goal of using task vectors:*enabling predictions similar to ICL*\. If the task vector successfully captures the task information inferred during ICL, its inference should yield a predictive distribution*aligned*with that of ICL\. We thus suggest to measure the discrepancy between the two distributions, formally defined as below\.

### 4\.2Proposed Metric

Recall thatPiclP\_\{\\text\{icl\}\}andPtvP\_\{\\text\{tv\}\}are the probabilities of the next token computed by the model, for the ICL mode and the TV mode, respectively\. Given demonstrationsZZ, we measure the quality of the task vector extraction methodffas the discrepancy of the ICL mode and the TV mode in terms of the next token probability \(NTP\), denoted by

dNTP\(f;Z\)=𝔼x∼𝒟\[DKL\(Picl\(⋅∣x,Z\)∥Ptv\(⋅∣x,f\(Z\)\)\)\],\\text\{$d\_\{\\text\{NTP\}\}$\}\(f;Z\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\Big\[D\_\{\\mathrm\{KL\}\}\\big\(P\_\{\\text\{icl\}\}\(\\cdot\\mid x,Z\)\\;\\\|\\;P\_\{\\text\{tv\}\}\(\\cdot\\mid x,f\(Z\)\)\\big\)\\Big\],\(9\)whereDKLD\_\{\\mathrm\{KL\}\}represents the Kullback\-Leibler \(KL\) divergence\[[22](https://arxiv.org/html/2605.20730#bib.bib61)\]operator\. Here,PiclP\_\{\\text\{icl\}\}serves as the reference distribution, and the metric quantifies how farPtvP\_\{\\text\{tv\}\}deviates from the reference\. See Fig\.[2](https://arxiv.org/html/2605.20730#S3.F2)for the illustration of our proposed metric\.

A lowerdNTPd\_\{\\text\{NTP\}\}indicates thatPtvP\_\{\\text\{tv\}\}is more closely aligned withPiclP\_\{\\text\{icl\}\}\. Thus, the task vector extraction methodffwith lowerdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)is considered a more desirable method\. Notably, our proposed metric evaluates the quality of task vector extraction methods without requiring labels for the test set\.

![Refer to caption](https://arxiv.org/html/2605.20730v1/x3.png)Figure 3:Correlation between the proposed discrepancy metricdNTP\(f;Z\)d\_\{\\text\{NTP\}\}\(f;Z\)and the test accuracy, measured on various task vector extraction methodsffand the demonstrationsZZ\. We test on four variants offf: Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\], Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\], State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\], and I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\], each of which is shown in different colors\. Here, each point corresponds to the result for different demonstrationsZZ\. Pearson correlation coefficients\(ρ\)\(\\rho\)are reported for each methodff\. Across four classification benchmarks \(columns\) and two models \(rows\), lowerdNTP\(f;Z\)d\_\{\\text\{NTP\}\}\(f;Z\)consistently correlates with higher accuracy, validating our metric as a principled criterion for the quality of the task vector\.
### 4\.3Correlation of Proposed Metric With Test Accuracy

Now, a valid question is, whether our proposed metricdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)in equation[9](https://arxiv.org/html/2605.20730#S4.E9)is a good indicator for the quality of the task vector extraction methodff, in the practical sense\. In this section, we empirically validate thatdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)strongly correlates with the test accuracy when task vector𝒗=f\(Z\)\{\\bm\{v\}\}=f\(Z\)is used, across diverse classification benchmarks\.

##### Experimental Setup\.

We evaluate on four classification benchmarks – AGNews, DBPedia\[[53](https://arxiv.org/html/2605.20730#bib.bib51)\], MR\[[35](https://arxiv.org/html/2605.20730#bib.bib54)\], SST\-2\[[40](https://arxiv.org/html/2605.20730#bib.bib57)\]\. We compute the metricdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)and the test accuracy for four task vector extraction methodsff– Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\], Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\], State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\], and I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]\. We test on LLaMA\-3\.1\-8B\[[11](https://arxiv.org/html/2605.20730#bib.bib39)\]and Qwen\-2\.5\-7B\[[38](https://arxiv.org/html/2605.20730#bib.bib40)\]models using 30 demonstrationsZZ, across 20 independent runs with randomly sampled demonstrationsZZ\. See Appendix[A\.3](https://arxiv.org/html/2605.20730#A1.SS3)for further details\.

##### Results\.

Fig\.[3](https://arxiv.org/html/2605.20730#S4.F3)presents scatter plots ofdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)versus accuracyAcc\(f\)\\text\{Acc\}\(f\), where each point corresponds to each result obtained with different demonstrationsZZ\. Across all tasks and models, lowerdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)consistently correlates with higher accuracy, with most Pearson correlation coefficients exceeding0\.60\.6in magnitude\. This strong correlation demonstrates that lowerdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)indicates superior task vector quality\. This result motivates developing a task vector extraction method aimed at reducingdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\), as proposed in the next section\.

## 5Proposed Method: Linear Task Vector

In Sec\.[4](https://arxiv.org/html/2605.20730#S4), we empirically observed that in the TV mode, the test accuracy negatively correlates with the discrepancydNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)measured for the task vector extraction methodff\. This motivates us to devise a task vector extraction method that achieves a smalldNTPd\_\{\\text\{NTP\}\}\(f\)\(f\), which is expected to improve test accuracy when the task vector is used\.

Based on this motivation, we propose Linear Task Vector \(LTV\), which leverages a linear mapping to compute task vectors that enable TV mode inference to closely resemble that of ICL\. We first present the rationale behind our approach in Sec\.[5\.1](https://arxiv.org/html/2605.20730#S5.SS1), then formally describeLTVin Sec\.[5\.2](https://arxiv.org/html/2605.20730#S5.SS2)\.

### 5\.1Rationale Behind the Proposed Method

The proposedLTVmethod is designed to achieve the following two goals simultaneously\. First, we aim to design a methodffthat achieves a smalldNTPd\_\{\\text\{NTP\}\}\(f\)\(f\), meaning that the next token probabilityPtvP\_\{\\text\{tv\}\}under TV mode closely replicatesPiclP\_\{\\text\{icl\}\}under ICL mode\. Second, we seek to preserve the key advantage of ICL, which does not require updating the parameters of a model\. These goals lead us to formulate a proxy optimization problem which satisfies two conditions: \(1\) the proxy objective is provably related to the original objectivedNTPd\_\{\\text\{NTP\}\}, and \(2\) the proxy problem has a closed\-form solution\.

![Refer to caption](https://arxiv.org/html/2605.20730v1/x4.png)Figure 4:Overview of our Linear Task Vector \(LTV\) method\. Our method employs a linear mapping𝑾\{\\bm\{W\}\}that estimates the effect of demonstrations in the hidden space\(𝒉icl−𝒉zs\)\(\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\)from the hidden state𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}of the zero\-shot inference mode via ridge regression\. In the extraction phase\(left\), we useNNunlabeled training queries\{xj\}j=1N\\\{x\_\{j\}\\\}\_\{j=1\}^\{N\}to define \(1\) the regression target matrix𝒀\{\\bm\{Y\}\}as the concatenation ofNNcolumn vectors\{𝒉icl\(xj\)−𝒉zs\(xj\)\}j=1N\\\{\{\\bm\{h\}\}\_\{\\text\{icl\}\}\(x\_\{j\}\)\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\_\{j\}\)\\\}\_\{j=1\}^\{N\}and \(2\) the variable matrix𝑯\{\\bm\{H\}\}as the concatenation ofNNcolumn vectors\{𝒉zs\(xj\)\}j=1N\\\{\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\_\{j\}\)\\\}\_\{j=1\}^\{N\}\. Subsequently, a closed\-form solution for the optimal linear mapping𝑾⋆\{\\bm\{W\}\}^\{\\star\}from𝑯\{\\bm\{H\}\}to𝒀\{\\bm\{Y\}\}is obtained\. In the inference phase\(right\), we use the mapping𝑾⋆\{\\bm\{W\}\}^\{\\star\}to inject the task vector𝒗=𝑾⋆𝒉zs\(xtest\)\{\\bm\{v\}\}=\{\\bm\{W\}\}^\{\\star\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\_\{\\text\{test\}\}\)in the hidden state, for a given test queryxtestx\_\{\\text\{test\}\}\.##### Formulation of the proxy optimization problem\.

Given demonstrationsZZ, our goal is to find the optimalffthat minimizes the discrepancydNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)betweenPiclP\_\{\\text\{icl\}\}andPtvP\_\{\\text\{tv\}\}\. As depicted in Fig\.[2](https://arxiv.org/html/2605.20730#S3.F2), the predictive distributionsPiclP\_\{\\text\{icl\}\}andPtvP\_\{\\text\{tv\}\}are obtained by applying the LM head to the respective hidden states𝒉icl\{\\bm\{h\}\}\_\{\\text\{icl\}\}and𝒉tv\{\\bm\{h\}\}\_\{\\text\{tv\}\}\. Thus, the metricdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)in equation[9](https://arxiv.org/html/2605.20730#S4.E9)can be expressed as:

dNTP\(f\)=𝔼\[DKL\(σ\(𝑾lm𝒉icl\)∥σ\(𝑾lm\(𝒉zs\+f\(Z\)\)\)\)\],\\text\{$d\_\{\\text\{NTP\}\}$\{\}\}\(f\)=\\mathbb\{E\}\\Big\[D\_\{\\mathrm\{KL\}\}\\big\(\\sigma\(\{\\bm\{W\}\}\_\{\\mathrm\{lm\}\}\{\\bm\{h\}\}\_\{\\text\{icl\}\}\)\\;\\\|\\;\\sigma\(\{\\bm\{W\}\}\_\{\\mathrm\{lm\}\}\(\{\\bm\{h\}\}\_\{\\text\{zs\}\}\+f\(Z\)\)\)\\big\)\\Big\],where𝒉tv=𝒉zs\+𝒗=𝒉zs\+f\(Z\)\{\\bm\{h\}\}\_\{\\text\{tv\}\}=\{\\bm\{h\}\}\_\{\\text\{zs\}\}\+\{\\bm\{v\}\}=\{\\bm\{h\}\}\_\{\\text\{zs\}\}\+f\(Z\)is the hidden state of TV mode\. Thus, minimizingdNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)with respect toffleads to the following optimization problem:

minf⁡𝔼\[DKL\(σ\(𝑾lm𝒉icl\)∥σ\(𝑾lm\(𝒉zs\+f\(Z\)\)\)\)\]\.\\min\_\{f\}\\;\\mathbb\{E\}\\Big\[D\_\{\\mathrm\{KL\}\}\\big\(\\sigma\(\{\\bm\{W\}\}\_\{\\mathrm\{lm\}\}\{\\bm\{h\}\}\_\{\\text\{icl\}\}\)\\;\\\|\\;\\sigma\(\{\\bm\{W\}\}\_\{\\mathrm\{lm\}\}\(\{\\bm\{h\}\}\_\{\\text\{zs\}\}\+f\(Z\)\)\)\\big\)\\Big\]\.\(10\)However, solving equation[10](https://arxiv.org/html/2605.20730#S5.E10)requires iteratively updating the task vector extraction modelff, as no closed\-form solution exists\. In order to preserve the key advantage of ICL \(which does not require any model updates\), we instead introduce a proxy objective in a way that the corresponding minimization problem has a closed\-form solution onff\. To this end, we define the proxy objective as the mean squared error \(MSE\) between the hidden states𝒉icl\{\\bm\{h\}\}\_\{\\text\{icl\}\}and𝒉tv\{\\bm\{h\}\}\_\{\\text\{tv\}\}:

ℒMSE\(f\)=𝔼\[‖𝒉icl−𝒉tv‖22\]=𝔼\[‖𝒉icl−𝒉zs−f\(Z\)‖22\]\.\\mathcal\{L\}\_\{\\text\{MSE\}\}\(f\)=\\mathbb\{E\}\\Big\[\\big\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{tv\}\}\\big\\\|\_\{2\}^\{2\}\\Big\]=\\mathbb\{E\}\\Big\[\\big\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\-f\(Z\)\\big\\\|\_\{2\}^\{2\}\\Big\]\.\(11\)Recall thatdNTPd\_\{\\text\{NTP\}\}in equation[9](https://arxiv.org/html/2605.20730#S4.E9)is defined as the expectation over queryxx\. With explicit dependence onxxand demonstrationsZZ, the proxy optimization problem becomes:

minf⁡𝔼x∼𝒟\[‖𝒉icl\(x,Z\)−𝒉zs\(x\)−f\(x,Z\)‖22\]\.\\min\_\{f\}\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\Big\[\\big\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\(x,Z\)\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\-f\(x,Z\)\\big\\\|\_\{2\}^\{2\}\\Big\]\.\(12\)
A natural question is how this proxy objectiveℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}is related to the original objectivedNTPd\_\{\\text\{NTP\}\}\. The following proposition answers this question:

###### Proposition 5\.1\(Relationship betweendNTPd\_\{\\text\{NTP\}\}andℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}\)\.

Assume the language modeling head𝐖lm\{\\bm\{W\}\}\_\{\\text\{lm\}\}has a bounded spectral norm‖𝐖lm‖2≤C1\\\|\{\\bm\{W\}\}\_\{\\text\{lm\}\}\\\|\_\{2\}\\leq C\_\{1\}and the log\-softmax function isC2C\_\{2\}\-Lipschitz in theℓ2\\ell\_\{2\}\-norm\. Then, for any functionff

dNTP\(f\)≤C1C2ℒMSE\(f\)\.\\text\{$d\_\{\\text\{NTP\}\}$\{\}\}\(f\)\\leq C\_\{1\}C\_\{2\}\\sqrt\{\\mathcal\{L\}\_\{\\text\{MSE\}\}\(f\)\}\.\(13\)

We defer the proof of Proposition[5\.1](https://arxiv.org/html/2605.20730#S5.Thmtheorem1)to Appendix[A\.2](https://arxiv.org/html/2605.20730#A1.SS2)\. This proposition indicates that we can reducedNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)by findingffwhich has smallℒMSE\(f\)\\mathcal\{L\}\_\{\\text\{MSE\}\}\(f\)\.

##### Using a linear mapping to solve the proxy problem\.

We now focus on solving the proxy problem in equation[12](https://arxiv.org/html/2605.20730#S5.E12),i\.e\.,findingffwhich outputs the task vector𝒗=f\(x,Z\)\{\\bm\{v\}\}=f\(x,Z\)that resembles𝒉icl−𝒉zs\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\. In other words, the purpose of task vector𝒗\{\\bm\{v\}\}is to compensate for the effect of demonstrations in the ICL mode \(compared with the zero\-shot inference mode which does not use demonstrations\) in the hidden state\. Thus, a natural approach to solve this problem is to find a mapping from𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(the hidden state for the zero\-shot inference mode\) to the target𝒉icl−𝒉zs\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\. In order to get a closed\-form solution for the optimization problem, we consider the simplest case222In Appendix[A\.7](https://arxiv.org/html/2605.20730#A1.SS7), we empirically compare this linear choice with alternatives, and discuss non\-linear extensions\.of using a*linear*mapping𝑾\{\\bm\{W\}\}, which results in the following formulation:

f\(x,Z\)=𝑾\(Z\)𝒉zs\(x\)\.\\displaystyle f\(x,Z\)=\{\\bm\{W\}\}\(Z\)\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\.This leads to the following optimization problem:

min𝑾⁡𝔼x∼𝒟\[‖𝒉icl\(x,Z\)−𝒉zs\(x\)−𝑾𝒉zs\(x\)‖22\]\.\\min\_\{\{\\bm\{W\}\}\}\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\Big\[\\big\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\(x,Z\)\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\-\{\\bm\{W\}\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\\big\\\|\_\{2\}^\{2\}\\Big\]\.\(14\)This reformulated problem in equation[14](https://arxiv.org/html/2605.20730#S5.E14)has a closed\-form solution, as described below\.

### 5\.2Formal Description of Proposed Method

We now formalize Linear Task Vector \(LTV\), a task vector extraction methodffwhich estimates a linear mapping𝑾⋆\{\\bm\{W\}\}^\{\\star\}that minimizes equation[14](https://arxiv.org/html/2605.20730#S5.E14)\. Specifically, in the extraction stage,LTVestimates𝑾⋆\{\\bm\{W\}\}^\{\\star\}via ridge regression\[[17](https://arxiv.org/html/2605.20730#bib.bib79)\]\. In the inference stage, the estimated optimal mapping𝑾⋆\{\\bm\{W\}\}^\{\\star\}is applied to compute the task vector\.

##### Extraction Stage\.

Given demonstrationsZZ, we estimate𝑾∈ℝd×d\{\\bm\{W\}\}\\in\\mathbb\{R\}^\{d\\times d\}by solving a regression problem mapping𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}to\(𝒉icl−𝒉zs\)\(\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\), whereddis the dimension of the hidden state\. As shown in the leftmost column of Fig\.[4](https://arxiv.org/html/2605.20730#S5.F4), we first sampleNNunlabeled queries\{xj\}j=1N\\\{x\_\{j\}\\\}\_\{j=1\}^\{N\}from the training set, and use them to compute the hidden stateshh\. Let𝑯∈ℝd×N\{\\bm\{H\}\}\\in\\mathbb\{R\}^\{d\\times N\}be the matrix whosejj\-th column is𝒉zs\(xj\)\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\_\{j\}\), and let𝒀∈ℝd×N\{\\bm\{Y\}\}\\in\\mathbb\{R\}^\{d\\times N\}be the matrix whosejj\-th column is\(𝒉icl\(xj\)−𝒉zs\(xj\)\)\(\{\\bm\{h\}\}\_\{\\text\{icl\}\}\(x\_\{j\}\)\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\_\{j\}\)\)\. To prevent ill\-conditioning of𝑯𝑯⊤\{\\bm\{H\}\}\{\\bm\{H\}\}^\{\\top\}whenddis large relative toNN, we employ the ridge regression

𝑾⋆=argmin𝑾\(‖𝒀−𝑾𝑯‖F2\+λ‖𝑾‖F2\),\{\\bm\{W\}\}^\{\\star\}=\\underset\{\{\\bm\{W\}\}\}\{\\operatorname\{argmin\}\}\\left\(\\\|\{\\bm\{Y\}\}\-\{\\bm\{W\}\}\{\\bm\{H\}\}\\\|\_\{F\}^\{2\}\+\\lambda\\\|\{\\bm\{W\}\}\\\|\_\{F\}^\{2\}\\right\),\(15\)whereλ\\lambdais the regularization parameter, and∥⋅∥F\\\|\\cdot\\\|\_\{F\}denotes the Frobenius norm\. This yields a closed\-form solution, computed by solving a linear system

\(𝑯𝑯⊤\+λ𝑰\)𝑾⋆⊤=𝑯𝒀⊤,\(\{\\bm\{H\}\}\{\\bm\{H\}\}^\{\\top\}\+\\lambda\{\\bm\{I\}\}\)\{\\bm\{W\}\}^\{\\star\\top\}=\{\\bm\{H\}\}\{\\bm\{Y\}\}^\{\\top\},\(16\)where𝑰\{\\bm\{I\}\}denotes the identity matrix\.

##### Inference Stage\.

As shown in the rightmost column of Fig\.[4](https://arxiv.org/html/2605.20730#S5.F4), we compute the task vector for a test queryxtestx\_\{\\text\{test\}\}as𝒗=𝑾⋆𝒉zs\(xtest\)\{\\bm\{v\}\}=\{\\bm\{W\}\}^\{\\star\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\_\{\\text\{test\}\}\)\. The task\-conditioned hidden state is then obtained as𝒉tv=𝒉zs\+𝒗\{\\bm\{h\}\}\_\{\\text\{tv\}\}=\{\\bm\{h\}\}\_\{\\text\{zs\}\}\+\{\\bm\{v\}\}, from which the LM head computes the predictive distributionPtvP\_\{\\text\{tv\}\}\.

## 6Experiments

In this section, we empirically validate the proposedLTVmethod\. Sec\.[6\.1](https://arxiv.org/html/2605.20730#S6.SS1)describes the evaluation setup, and Sec\.[6\.2](https://arxiv.org/html/2605.20730#S6.SS2)presents the experimental results\. Codes are available at[this GitHub repository](https://github.com/Jii111/LTV)\.

### 6\.1Experimental Setup

##### Evaluation\.

Following prior work\[[28](https://arxiv.org/html/2605.20730#bib.bib31),[12](https://arxiv.org/html/2605.20730#bib.bib62)\], we evaluateLTVon a diverse set of text classification benchmarks\. We consider eight widely used datasets: SST\-2, SST\-5\[[40](https://arxiv.org/html/2605.20730#bib.bib57)\], MR\[[36](https://arxiv.org/html/2605.20730#bib.bib55)\], AGNews, DBPedia\[[53](https://arxiv.org/html/2605.20730#bib.bib51)\], TREC\[[45](https://arxiv.org/html/2605.20730#bib.bib56)\], SUBJ\[[35](https://arxiv.org/html/2605.20730#bib.bib54)\], HateSpeech18\[[8](https://arxiv.org/html/2605.20730#bib.bib50)\]\. We report the average accuracy over five independent runs, each using a different set of demonstrations withk=30k=30\. ForLTV, we report results usingN=256N=256train queries and a regularization parameterλ=5\\lambda=5selected via grid search\. Further details on evaluation are provided in Appendix[A\.3](https://arxiv.org/html/2605.20730#A1.SS3)\.

##### Models\.

We conduct experiments on five widely used LLMs: LLaMA\-2\-7B\[[42](https://arxiv.org/html/2605.20730#bib.bib28)\], LLaMA\-2\-13B\[[42](https://arxiv.org/html/2605.20730#bib.bib28)\], LLaMA\-3\.1\-8B\[[11](https://arxiv.org/html/2605.20730#bib.bib39)\], Qwen\-2\.5\-7B\[[38](https://arxiv.org/html/2605.20730#bib.bib40)\], and Qwen\-3\-8B\[[49](https://arxiv.org/html/2605.20730#bib.bib52)\]\.

##### Baselines\.

We compareLTVwith existing task vector methods in a training\-free setting, which can be categorized into three groups based on the module from which task vectors are extracted\. First,*Task Vector*\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]extracts the task vector from the outputs of a single*decoder layer*\. Second,*Function Vector*\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]and*State Vector*\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]extract task vectors from a subset of*attention heads*\. Third,*I2CL*\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]extracts task vectors from both attention heads and MLP modules\. In I2CL, the task vector is injected after scaling, where the scaling coefficient is*trained*using a validation set with true labels\. For fair comparison in a training\-free setting, we use the default coefficient selected in\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\], instead of training the coefficient\. We also compare with standard zero\-shot and ICL baselines\. Further details on each method are provided in Appendix[A\.4](https://arxiv.org/html/2605.20730#A1.SS4)\.

Table 1:Classification accuracy \(%\) ofLTVand four TV methods, tested on eight benchmarks\. We also compare with two inference modes \(zero\-shot and ICL\) for reference\.LTVmethod achieves the highest average accuracy \(Avg\.\) across all models; check full results on five LLMs in Appendix[A\.7](https://arxiv.org/html/2605.20730#A1.SS7)\.ModelsMethodsAGNewsDBPediaHateSpeech18MRSST\-2SST\-5SubjTRECAvg\.LLaMA\-2\-13B\[[42](https://arxiv.org/html/2605.20730#bib.bib28)\]Zero\-shot73\.8076\.2055\.2061\.2066\.6020\.8049\.8069\.6059\.15ICL88\.04\(±\\pm1\.7\)96\.88\(±\\pm1\.1\)77\.40\(±\\pm2\.2\)94\.44\(±\\pm0\.5\)94\.80\(±\\pm0\.7\)46\.52\(±\\pm2\.7\)82\.28\(±\\pm7\.1\)83\.36\(±\\pm5\.2\)82\.97Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]74\.00\(±\\pm1\.2\)76\.40\(±\\pm7\.5\)54\.08\(±\\pm0\.4\)61\.84\(±\\pm3\.4\)72\.44\(±\\pm6\.0\)21\.40\(±\\pm1\.2\)50\.00\(±\\pm0\.0\)69\.96\(±\\pm1\.1\)60\.02Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]79\.84\(±\\pm2\.9\)80\.72\(±\\pm1\.9\)71\.48\(±\\pm6\.7\)83\.16\(±\\pm0\.8\)82\.80\(±\\pm4\.2\)33\.64\(±\\pm2\.1\)49\.72\(±\\pm0\.1\)73\.84\(±\\pm5\.3\)69\.40State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]84\.64\(±\\pm4\.0\)89\.48\(±\\pm4\.9\)58\.80\(±\\pm8\.8\)89\.20\(±\\pm2\.8\)87\.20\(±\\pm3\.6\)36\.08\(±\\pm3\.0\)49\.40\(±\\pm0\.6\)66\.96\(±\\pm1\.0\)70\.22I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]78\.88\(±\\pm0\.3\)79\.00\(±\\pm0\.5\)54\.48\(±\\pm0\.1\)60\.68\(±\\pm0\.3\)64\.80\(±\\pm0\.4\)25\.96\(±\\pm0\.1\)50\.00\(±\\pm0\.0\)69\.12\(±\\pm0\.4\)60\.37\\cellcolorgray\!15LTV\(Ours\)\\cellcolorgray\!1586\.68\(±\\pm1\.7\)\\cellcolorgray\!1593\.20\(±\\pm0\.5\)\\cellcolorgray\!1572\.08\(±\\pm2\.2\)\\cellcolorgray\!1590\.20\(±\\pm2\.0\)\\cellcolorgray\!1588\.96\(±\\pm2\.1\)\\cellcolorgray\!1541\.88\(±\\pm2\.2\)\\cellcolorgray\!1578\.76\(±\\pm2\.7\)\\cellcolorgray\!1583\.92\(±\\pm6\.3\)\\cellcolorgray\!1579\.46LLaMA\-3\.1\-8B\[[11](https://arxiv.org/html/2605.20730#bib.bib39)\]Zero\-shot75\.0069\.0060\.6082\.2086\.8025\.6059\.2049\.4063\.48ICL87\.16\(±\\pm1\.2\)97\.68\(±\\pm0\.8\)74\.32\(±\\pm8\.1\)94\.52\(±\\pm0\.4\)94\.20\(±\\pm1\.1\)48\.32\(±\\pm1\.1\)85\.96\(±\\pm5\.3\)79\.08\(±\\pm7\.3\)82\.66Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]76\.16\(±\\pm0\.1\)69\.44\(±\\pm1\.0\)63\.00\(±\\pm0\.2\)83\.24\(±\\pm0\.3\)87\.32\(±\\pm0\.6\)26\.20\(±\\pm1\.3\)59\.52\(±\\pm0\.2\)69\.12\(±\\pm0\.6\)66\.75Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]81\.36\(±\\pm3\.7\)83\.24\(±\\pm1\.5\)67\.64\(±\\pm2\.0\)83\.48\(±\\pm3\.5\)87\.88\(±\\pm0\.3\)34\.76\(±\\pm1\.9\)61\.12\(±\\pm1\.4\)66\.48\(±\\pm1\.0\)70\.75State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]80\.28\(±\\pm4\.3\)80\.80\(±\\pm1\.5\)65\.40\(±\\pm1\.3\)85\.96\(±\\pm4\.9\)84\.12\(±\\pm0\.6\)36\.60\(±\\pm1\.6\)62\.52\(±\\pm3\.5\)67\.28\(±\\pm1\.9\)70\.37I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]76\.76\(±\\pm0\.4\)72\.56\(±\\pm0\.4\)62\.24\(±\\pm0\.5\)85\.24\(±\\pm0\.3\)90\.80\(±\\pm0\.2\)32\.48\(±\\pm0\.4\)62\.28\(±\\pm0\.1\)49\.00\(±\\pm0\.5\)66\.42\\cellcolorgray\!15LTV\(Ours\)\\cellcolorgray\!1582\.84\(±\\pm2\.5\)\\cellcolorgray\!1593\.36\(±\\pm1\.4\)\\cellcolorgray\!1570\.44\(±\\pm6\.0\)\\cellcolorgray\!1588\.68\(±\\pm1\.0\)\\cellcolorgray\!1590\.08\(±\\pm1\.2\)\\cellcolorgray\!1538\.20\(±\\pm3\.4\)\\cellcolorgray\!1567\.16\(±\\pm11\.4\)\\cellcolorgray\!1572\.88\(±\\pm7\.2\)\\cellcolorgray\!1575\.46

![Refer to caption](https://arxiv.org/html/2605.20730v1/x5.png)Figure 5:Comparison ofdNTPd\_\{\\text\{NTP\}\}acrossLTVand four baselines on eight benchmarks, tested on LLaMA\-3\.1\-8B\.LTVconsistently achieves the lowestdNTPd\_\{\\text\{NTP\}\}across all benchmarks\.Table 2:Extraction and inference latency ofLTVand task vector baselines, evaluated on LLaMA\-3\.1\-8B\. Classification accuracy \(%\) is reported for reference\.Zero\-shotICLFunction VectorI2CLState VectorTask Vector\\cellcolorgray\!15LTV\(Ours\)Extraction \(s\)↓\\downarrow\-\-198\.9651\.34727\.75627\.773\\cellcolorgray\!1542\.664Inference \(s\)↓\\downarrow0\.02200\.22380\.06140\.02460\.02920\.0236\\cellcolorgray\!150\.0226Accuracy \(%\)↑\\uparrow63\.4882\.6666\.7566\.4270\.3770\.75\\cellcolorgray\!1575\.46

Table 3:Effect of hyperparameters \(the number of unlabeled queriesNNand the regularization parameterλ\\lambda\) on the performance ofLTVanddNTPd\_\{\\text\{NTP\}\}\. Experiments are conducted on LLaMA\-3\.1\-8B, with gray columns indicating default settings\.NNλ\\lambda64128\\cellcolorgray\!152561\.0\\cellcolorgray\!155\.010\.0Avg\. Acc\. \(%\)↑\\uparrow72\.274\.7\\cellcolorgray\!1575\.275\.2\\cellcolorgray\!1575\.275\.2dNTPd\_\{\\text\{NTP\}\}↓\\downarrow0\.2290\.178\\cellcolorgray\!150\.1470\.149\\cellcolorgray\!150\.1470\.145

Table 4:Regression performance ofLTVand task vector baselines, measured by mean squared error \(MSE\), on linear regression and ReLU regression tasks\. Experiments are conducted on LLaMA\-3\.1\-8B\. Zero\-shot and ICL are shown for reference\.Zero\-shotICLFunction VectorI2CLState VectorTask Vector\\cellcolorgray\!15LTV\(Ours\)Linear Regression↓\\downarrow5\.513\.975\.235\.295\.355\.97\\cellcolorgray\!155\.13ReLU Regression↓\\downarrow3\.823\.333\.633\.683\.804\.05\\cellcolorgray\!153\.45

Table 5:Classification accuracy \(%\) ofLTVwhen task vectors obtained from a large model are transferred and applied to a small model for task vector\-based inference\. We use Qwen\-2\.5\-72B as the large model and Qwen\-2\.5\-7B as the small model\. As a reference, we report zero\-shot, ICL, andLTVinference of each model using its own task vectors\. Values in parentheses indicate the change relative to theLTVperformance of the small model\.ModelsMethodsAGNewsDBPediaHateSpeech18MRSST\-2SST\-5SubjTRECAvg\.Small Model\(Qwen\-2\.5\-7B\[[38](https://arxiv.org/html/2605.20730#bib.bib40)\]\)Zero\-shot74\.2073\.2061\.2062\.2055\.4020\.0056\.8067\.2058\.78ICL83\.2483\.4880\.6092\.8895\.0046\.6076\.0087\.0480\.60LTV78\.4877\.2472\.2887\.5689\.0033\.1666\.5284\.3273\.57Large Model\(Qwen\-2\.5\-72B\[[38](https://arxiv.org/html/2605.20730#bib.bib40)\]\)Zero\-shot72\.0068\.0070\.4092\.0090\.6036\.0053\.0073\.0069\.38ICL88\.0895\.6482\.8093\.8896\.4446\.1295\.8075\.4484\.25LTV81\.4088\.9277\.2091\.4894\.3637\.1691\.9277\.2479\.96\\rowcolorgray\!15 TransferredLTV\(72B→\\rightarrow7B\)83\.2492\.6876\.4891\.3692\.3642\.4890\.2471\.1680\.00\\rowcolorgray\!15\(\+4\.76\)\(\+15\.44\)\(\+4\.20\)\(\+3\.80\)\(\+3\.36\)\(\+9\.32\)\(\+23\.72\)\(\-13\.16\)\(\+6\.43\)

### 6\.2Results

##### LTVoutperforms baselines\.

Table[1](https://arxiv.org/html/2605.20730#S6.T1)reports classification accuracy ofLTVand baseline task vector methods across eight benchmarks\. We present results on two representative models here, with full results on all five LLMs provided in Appendix[A\.7](https://arxiv.org/html/2605.20730#A1.SS7)\. As shown in the rightmost column,LTVachieves the highest average accuracy on all five models\. Notably,LTVoutperforms the best baseline –*State Vector*– on LLaMA\-2\-13B by 9\.2%, achieving a score of 79\.46% compared to 70\.22%\. Moreover,LTVranks first on the majority of individual tasks across all models; specifically, our method achieves the best performance on 31 out of 40 cases measured on 8 benchmarks and 5 models\. This strong performance highlights that, among TV methods,LTVmost effectively extracts task vectors that capture the effect of demonstrations\.

##### LTVeffectively reduces the proposed metric\.

Recall thatLTVis designed to achieve a smalldNTPd\_\{\\text\{NTP\}\}\. Given its strong performance over baselines, a natural follow\-up is whetherLTVindeed attains lowerdNTPd\_\{\\text\{NTP\}\}\. Fig\.[5](https://arxiv.org/html/2605.20730#S6.F5)comparesdNTPd\_\{\\text\{NTP\}\}across five methods on eight benchmarks experimented with LLaMA\-3\.1\-8B model\. Compared to other baselines,LTVachieves the lowestdNTPd\_\{\\text\{NTP\}\}on all benchmarks\. This confirms thatLTVworks as intended, successfully reducing the discrepancy between TV and ICL modes\.

##### LTVincurs the lowest inference overhead\.

We evaluate the computational efficiency ofLTVin terms of one\-time extraction cost and per\-sample inference latency \(Table[2](https://arxiv.org/html/2605.20730#S6.T2)\)\. AlthoughLTVrequires a higher extraction time compared to some TV baselines, this overhead is a one\-time pre\-computation cost that is effectively amortized over the entire test set\. More importantly,LTVachieves the most competitive inference latency \(0\.0226s per sample\) among all considered vector\-based methods, matching the speed of zero\-shot inference\. This efficiency stems from the fact thatLTVoperates solely on the final layer through a single matrix\-vector multiplication, providing a significant accuracy boost \(75\.46%\) without compromising the model’s original inference throughput\.

##### LTVis robust to hyperparameter choices\.

One might wonder whether the performance ofLTVrelies on careful hyperparameter tuning\. We examine the effect of the number of queriesNNand the ridge regularization parameterλ\\lambdaon both accuracy anddNTPd\_\{\\text\{NTP\}\}\. Table[3](https://arxiv.org/html/2605.20730#S6.T3)reports results obtained by varying one hyperparameter at a time\. HalvingNNfrom the default value of 256 to 128 decreases accuracy by only 0\.5% with a slight increase indNTPd\_\{\\text\{NTP\}\}\. Similarly, varyingλ\\lambdafrom the default value of 5\.0 to 1\.0 or 10\.0 maintains comparable accuracy anddNTPd\_\{\\text\{NTP\}\}\. These results confirm thatLTVis robust to hyperparameter choices\.

##### LTValso remains effective on regression tasks\.

Most prior studies on task vectors have focused exclusively on classification, leaving it unclear whether existing TV methods are effective in settings with continuous outputs\. We address this gap by evaluating TV methods on regression tasks, following the experimental setup of prior studies\[[50](https://arxiv.org/html/2605.20730#bib.bib34),[13](https://arxiv.org/html/2605.20730#bib.bib85)\]\. Specifically, we conduct linear regression and ReLU\[[1](https://arxiv.org/html/2605.20730#bib.bib78)\]regression tasks, and report the mean squared error \(MSE\) ofLTVand task vector baselines in Table[4](https://arxiv.org/html/2605.20730#S6.T4)\.LTVachieves the lowest MSE among all task vector baselines on both tasks \(5\.13 on linear regression; 3\.45 on ReLU regression\), showing that our approach is also effective on regression tasks\. Detailed setups for the regression experiments are provided in Appendix[A\.5](https://arxiv.org/html/2605.20730#A1.SS5)\.

##### LTVcan transfer task vectors across model scales\.

One might be concerned whether task vector methods remain useful even when the inference model cannot perform ICL effectively\. Such cases arise, for instance, when the context length is too short to fit enough demonstrations or when the inference model has limited ICL capability\.LTVcan address this concern through cross\-model transfer, which is enabled by a simple extension \(see Appendix[A\.6](https://arxiv.org/html/2605.20730#A1.SS6)for details\)\. This transfer allows us to extract task vectors from a larger, more capable model and apply them to a smaller one\. Notably, as shown in Table[6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px3), transferringLTVfrom the 72B model to the 7B model improves the average classification accuracy of the small model by 6\.43%, even matching the accuracy of the 72B model \(79\.96%\)\. This highlights the practicality ofLTVeven when the inference model is constrained\.

## 7Conclusion

In this paper, we investigated the mechanics of task vectors \(TVs\) as a means of compressing In\-Context Learning \(ICL\) demonstrations into the hidden states of LLMs\. We introduceddNTPd\_\{\\text\{NTP\}\}, a metric that quantifies the discrepancy in next\-token probability between TV\-based and standard ICL\-based inference\. Our analysis reveals thatdNTPd\_\{\\text\{NTP\}\}serves as a proxy for downstream performance, enabling the evaluation of TV methods without the need for exhaustive task\-specific testing\. Leveraging these insights, we developed Linear Task Vector \(LTV\), a method that employs a linear mapping to minimizedNTPd\_\{\\text\{NTP\}\}\. Experimental results across multiple benchmarks and architectures demonstrate that LTV outperforms existing training\-free TV methods\. By achieving superior accuracy with reduced inference latency, LTV offers a more efficient and effective framework for task compression in LLMs\.

## References

- \[1\]A\. F\. Agarap\(2018\)Deep learning using rectified linear units \(relu\)\.arXiv preprint arXiv:1803\.08375\.Cited by:[3rd item](https://arxiv.org/html/2605.20730#A1.I2.i3.p1.10),[§A\.5](https://arxiv.org/html/2605.20730#A1.SS5.p2.11),[§6\.2](https://arxiv.org/html/2605.20730#S6.SS2.SSS0.Px5.p1.1)\.
- \[2\]R\. Agarwal, A\. Singh, L\. Zhang, B\. Bohnet, L\. Rosias, S\. Chan, B\. Zhang, A\. Anand, Z\. Abbas, A\. Nova,et al\.\(2024\)Many\-shot in\-context learning\.Advances in Neural Information Processing Systems37,pp\. 76930–76966\.Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p1.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p1.1)\.
- \[3\]K\. Ahn, X\. Cheng, H\. Daneshmand, and S\. Sra\(2023\)Transformers learn to implement preconditioned gradient descent for in\-context learning\.Advances in Neural Information Processing Systems36,pp\. 45614–45650\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[4\]E\. Akyürek, D\. Schuurmans, J\. Andreas, T\. Ma, and D\. Zhou\(2022\)What learning algorithm is in\-context learning? investigations with linear models\.arXiv preprint arXiv:2211\.15661\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[5\]R\. Belanec, S\. Ostermann, I\. Srba, and M\. Bielikova\(2025\)Task prompt vectors: effective initialization through multi\-task soft prompt transfer\.InJoint European Conference on Machine Learning and Knowledge Discovery in Databases,pp\. 77–94\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px1.p1.1)\.
- \[6\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p1.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]A\. Chatterjee, K\. N\. Narahari, M\. Joshi, and P\. Agrawal\(2019\)SemEval\-2019 task 3: emocontext contextual emotion detection in text\.InProceedings of the 13th international workshop on semantic evaluation,pp\. 39–48\.Cited by:[§A\.3](https://arxiv.org/html/2605.20730#A1.SS3.p1.1)\.
- \[8\]O\. De Gibert, N\. Perez, A\. García\-Pablos, and M\. Cuadros\(2018\)Hate speech dataset from a white supremacy forum\.arXiv preprint arXiv:1809\.04444\.Cited by:[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px1.p1.3)\.
- \[9\]Q\. Dong, L\. Li, D\. Dai, C\. Zheng, J\. Ma, R\. Li, H\. Xia, J\. Xu, Z\. Wu, B\. Chang,et al\.\(2024\)A survey on in\-context learning\.InProceedings of the 2024 conference on empirical methods in natural language processing,pp\. 1107–1128\.Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p1.1)\.
- \[10\]Y\. Dong, J\. Jiang, Z\. Zhu, and X\. Ning\(2025\)Understanding task vectors in in\-context learning: emergence, functionality, and limitations\.arXiv preprint arXiv:2506\.09048\.Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p2.1)\.
- \[11\]A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[Table 7](https://arxiv.org/html/2605.20730#A1.T7.240.240.246.1.1.1.2),[Table 8](https://arxiv.org/html/2605.20730#A1.T8),[Table 8](https://arxiv.org/html/2605.20730#A1.T8.16.8),[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.96.96.99.1.1.1.2)\.
- \[12\]B\. Gao, X\. Wang, Y\. Yang, and D\. A\. Clifton\(2025\)Optimization inspired few\-shot adaptation for large language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=rZ2nSt1X58)Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p1.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px1.p1.3)\.
- \[13\]S\. Garg, D\. Tsipras, P\. S\. Liang, and G\. Valiant\(2022\)What can transformers learn in\-context? a case study of simple function classes\.Advances in neural information processing systems35,pp\. 30583–30598\.Cited by:[§A\.5](https://arxiv.org/html/2605.20730#A1.SS5.p1.3),[§A\.5](https://arxiv.org/html/2605.20730#A1.SS5.p2.11),[§6\.2](https://arxiv.org/html/2605.20730#S6.SS2.SSS0.Px5.p1.1)\.
- \[14\]S\. Golchin, Y\. Chen, R\. Han, M\. Gandhi, T\. Yu, S\. Mishra, M\. Surdeanu, R\. Agarwal, C\. Lee, and T\. Pfister\(2025\)Towards compute\-optimal many\-shot in\-context learning\.InSecond Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p1.1)\.
- \[15\]S\. Han, J\. Song, J\. Gore, and P\. Agrawal\(2025\)Emergence and effectiveness of task vectors in in\-context learning: an encoder decoder perspective\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=0ysC6VS0y3)Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p2.1)\.
- \[16\]R\. Hendel, M\. Geva, and A\. Globerson\(2023\)In\-context learning creates task vectors\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[1st item](https://arxiv.org/html/2605.20730#A1.I1.i1.p1.1),[§A\.3](https://arxiv.org/html/2605.20730#A1.SS3.p1.1),[Table 10](https://arxiv.org/html/2605.20730#A1.T10.5.1.5.1.1),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.120.120.120.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.168.168.168.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.216.216.216.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.24.24.24.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.72.72.72.9.2),[Table 9](https://arxiv.org/html/2605.20730#A1.T9.5.1.5.1.1),[§1](https://arxiv.org/html/2605.20730#S1.p2.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p2.1),[§3\.2](https://arxiv.org/html/2605.20730#S3.SS2.p1.1),[Figure 3](https://arxiv.org/html/2605.20730#S4.F3),[Figure 3](https://arxiv.org/html/2605.20730#S4.F3.16.8),[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.24.24.24.9.2),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.72.72.72.9.2)\.
- \[17\]A\. E\. Hoerl and R\. W\. Kennard\(1970\)Ridge regression: biased estimation for nonorthogonal problems\.Technometrics12\(1\),pp\. 55–67\.Cited by:[§5\.2](https://arxiv.org/html/2605.20730#S5.SS2.p1.4)\.
- \[18\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[Figure 7](https://arxiv.org/html/2605.20730#A1.F7),[Figure 7](https://arxiv.org/html/2605.20730#A1.F7.2.1),[§A\.7](https://arxiv.org/html/2605.20730#A1.SS7.SSS0.Px4.p1.6)\.
- \[19\]G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi\(2023\)Editing models with task arithmetic\.InThe Eleventh International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]J\. Jukić and J\. Šnajder\(2024\)Disentangling latent shifts of in\-context learning with weak supervision\.arXiv preprint arXiv:2410\.01508\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p1.1)\.
- \[21\]J\. Kang, S\. Lee, S\. Park, S\. Park, T\. Kim, J\. Kim, R\. Lee, and K\. Song\(2025\)Adaptive task vectors for large language models\.arXiv preprint arXiv:2506\.03426\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p2.1)\.
- \[22\]S\. Kullback and R\. A\. Leibler\(1951\)On information and sufficiency\.The annals of mathematical statistics22\(1\),pp\. 79–86\.Cited by:[§4\.2](https://arxiv.org/html/2605.20730#S4.SS2.p1.7)\.
- \[23\]B\. Lester, R\. Al\-Rfou, and N\. Constant\(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 conference on empirical methods in natural language processing,pp\. 3045–3059\.Cited by:[Figure 7](https://arxiv.org/html/2605.20730#A1.F7),[Figure 7](https://arxiv.org/html/2605.20730#A1.F7.2.1),[§A\.7](https://arxiv.org/html/2605.20730#A1.SS7.SSS0.Px4.p1.6)\.
- \[24\]D\. Li, Z\. Liu, X\. Hu, Z\. Sun, B\. Hu, and M\. Zhang\(2024\)In\-context learning state vector with inner and momentum optimization\.Advances in Neural Information Processing Systems37,pp\. 7797–7820\.Cited by:[3rd item](https://arxiv.org/html/2605.20730#A1.I1.i3.p1.1),[Table 10](https://arxiv.org/html/2605.20730#A1.T10.5.1.6.1.1),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.128.128.128.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.176.176.176.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.224.224.224.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.32.32.32.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.80.80.80.9.2),[Table 9](https://arxiv.org/html/2605.20730#A1.T9.5.1.6.1.1),[§1](https://arxiv.org/html/2605.20730#S1.p2.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p2.1),[Figure 3](https://arxiv.org/html/2605.20730#S4.F3),[Figure 3](https://arxiv.org/html/2605.20730#S4.F3.16.8),[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.32.32.32.9.2),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.80.80.80.9.2)\.
- \[25\]H\. Li, Y\. Zhang, S\. Zhang, M\. Wang, S\. Liu, and P\. Chen\(2025\)When is task vector provably effective for model editing? a generalization analysis of nonlinear transformers\.arXiv preprint arXiv:2504\.10957\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px1.p1.1)\.
- \[26\]J\. Li, Y\. Li, L\. Han, R\. Tang, and W\. Wang\(2025\)Towards generalizable implicit in\-context learning with attention routing\.arXiv preprint arXiv:2509\.22854\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p1.1)\.
- \[27\]Y\. Li, M\. E\. Ildiz, D\. Papailiopoulos, and S\. Oymak\(2023\)Transformers as algorithms: generalization and stability in in\-context learning\.InInternational conference on machine learning,pp\. 19565–19594\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[28\]Z\. Li, Z\. Xu, L\. Han, Y\. Gao, S\. Wen, D\. Liu, H\. Wang, and D\. N\. Metaxas\(2025\)Implicit in\-context learning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=G7u4ue6ncT)Cited by:[4th item](https://arxiv.org/html/2605.20730#A1.I1.i4.p1.1),[§A\.3](https://arxiv.org/html/2605.20730#A1.SS3.p1.1),[Table 10](https://arxiv.org/html/2605.20730#A1.T10.5.1.7.1.1),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.136.136.136.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.184.184.184.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.232.232.232.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.40.40.40.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.88.88.88.9.2),[Table 9](https://arxiv.org/html/2605.20730#A1.T9.5.1.7.1.1),[§1](https://arxiv.org/html/2605.20730#S1.p1.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p2.1),[§3\.2](https://arxiv.org/html/2605.20730#S3.SS2.p1.1),[Figure 3](https://arxiv.org/html/2605.20730#S4.F3),[Figure 3](https://arxiv.org/html/2605.20730#S4.F3.16.8),[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px1.p1.3),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.40.40.40.9.2),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.88.88.88.9.2)\.
- \[29\]S\. Liu, H\. Ye, L\. Xing, and J\. Y\. Zou\(2024\)In\-context vectors: making in context learning more effective and controllable through latent space steering\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p2.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p2.1)\.
- \[30\]I\. Loshchilov and F\. Hutter\(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[3rd item](https://arxiv.org/html/2605.20730#A1.I2.i3.p1.10)\.
- \[31\]S\. Mittal, E\. Elmoznino, L\. Gagnon, S\. Bhardwaj, G\. Lajoie, and D\. Sridhar\(2025\)Does learning the right latent variables necessarily improve in\-context learning?\.InForty\-second International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[32\]J\. Mu, X\. Li, and N\. Goodman\(2023\)Learning to compress prompts with gist tokens\.Advances in Neural Information Processing Systems36,pp\. 19327–19352\.Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p1.1)\.
- \[33\]C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen,et al\.\(2022\)In\-context learning and induction heads\.arXiv preprint arXiv:2209\.11895\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[34\]G\. Ortiz\-Jimenez, A\. Favero, and P\. Frossard\(2023\)Task arithmetic in the tangent space: improved editing of pre\-trained models\.Advances in Neural Information Processing Systems36,pp\. 66727–66754\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px1.p1.1)\.
- \[35\]B\. Pang and L\. Lee\(2004\)A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts\.arXiv preprint cs/0409058\.Cited by:[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px1.p1.3)\.
- \[36\]B\. Pang and L\. Lee\(2005\)Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales\.arXiv preprint cs/0506075\.Cited by:[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px1.p1.3)\.
- \[37\]M\. Panwar, K\. Ahuja, and N\. Goyal\(2023\)In\-context learning through the bayesian prism\.arXiv preprint arXiv:2306\.04891\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[38\]A\. Y\. Qwen, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv preprint\.Cited by:[Table 7](https://arxiv.org/html/2605.20730#A1.T7.240.240.242.1.1.1.2),[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px3.1.1.1.3.1.1.1.2.1),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px3.1.1.1.6.1.1.1.2.1)\.
- \[39\]B\. Saglam, X\. Hu, Z\. Yang, D\. Kalogerias, and A\. Karbasi\(2025\)Learning task representations from in\-context learning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 6634–6663\.Cited by:[§3\.2](https://arxiv.org/html/2605.20730#S3.SS2.p1.1)\.
- \[40\]R\. Socher, A\. Perelygin, J\. Wu, J\. Chuang, C\. D\. Manning, A\. Y\. Ng, and C\. Potts\(2013\)Recursive deep models for semantic compositionality over a sentiment treebank\.InProceedings of the 2013 conference on empirical methods in natural language processing,pp\. 1631–1642\.Cited by:[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px1.p1.3)\.
- \[41\]E\. Todd, M\. Li, A\. S\. Sharma, A\. Mueller, B\. C\. Wallace, and D\. Bau\(2024\)Function vectors in large language models\.InThe Twelfth International Conference on Learning Representations,Cited by:[2nd item](https://arxiv.org/html/2605.20730#A1.I1.i2.p1.1),[§A\.3](https://arxiv.org/html/2605.20730#A1.SS3.p1.1),[Table 10](https://arxiv.org/html/2605.20730#A1.T10.5.1.4.1.1),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.112.112.112.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.16.16.16.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.160.160.160.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.208.208.208.9.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.64.64.64.9.2),[Table 9](https://arxiv.org/html/2605.20730#A1.T9.5.1.4.1.1),[§1](https://arxiv.org/html/2605.20730#S1.p2.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p2.1),[Figure 3](https://arxiv.org/html/2605.20730#S4.F3),[Figure 3](https://arxiv.org/html/2605.20730#S4.F3.16.8),[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.16.16.16.9.2),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.64.64.64.9.2)\.
- \[42\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[Table 7](https://arxiv.org/html/2605.20730#A1.T7.240.240.244.1.1.1.2),[Table 7](https://arxiv.org/html/2605.20730#A1.T7.240.240.245.1.1.1.2),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.20730#S6.T1.96.96.98.1.1.1.2)\.
- \[43\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[2nd item](https://arxiv.org/html/2605.20730#S3.I1.i2.p1.1)\.
- \[44\]J\. Von Oswald, E\. Niklasson, E\. Randazzo, J\. Sacramento, A\. Mordvintsev, A\. Zhmoginov, and M\. Vladymyrov\(2023\)Transformers learn in\-context by gradient descent\.InInternational Conference on Machine Learning,pp\. 35151–35174\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[45\]E\. M\. Voorhees and D\. M\. Tice\(2000\)Building a question answering test collection\.InProceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval,pp\. 200–207\.Cited by:[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px1.p1.3)\.
- \[46\]F\. Wang, J\. Yan, Y\. Zhang, and T\. Lin\(2025\)ELICIT: llm augmentation via external in\-context capability\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px3.p2.1)\.
- \[47\]J\. Wei, Y\. Tay, R\. Bommasani, C\. Raffel, B\. Zoph, S\. Borgeaud, D\. Yogatama, M\. Bosma, D\. Zhou, D\. Metzler,et al\.\(2022\)Emergent abilities of large language models\.arXiv preprint arXiv:2206\.07682\.Cited by:[§1](https://arxiv.org/html/2605.20730#S1.p1.1)\.
- \[48\]S\. M\. Xie, A\. Raghunathan, P\. Liang, and T\. Ma\(2021\)An explanation of in\-context learning as implicit bayesian inference\.arXiv preprint arXiv:2111\.02080\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1),[§3\.3\.3](https://arxiv.org/html/2605.20730#S3.SS3.SSS3.p1.3)\.
- \[49\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Table 7](https://arxiv.org/html/2605.20730#A1.T7.240.240.243.1.1.1.2),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px2.p1.1)\.
- \[50\]L\. Yang, Z\. Lin, K\. Lee, D\. Papailiopoulos, and R\. D\. Nowak\(2025\)Task vectors in in\-context learning: emergence, formation, and benefit\.CoRR\.Cited by:[§A\.5](https://arxiv.org/html/2605.20730#A1.SS5.p1.3),[§1](https://arxiv.org/html/2605.20730#S1.p2.1),[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px1.p1.1),[§6\.2](https://arxiv.org/html/2605.20730#S6.SS2.SSS0.Px5.p1.1)\.
- \[51\]K\. Yin and J\. Steinhardt\(2025\)Which attention heads matter for in\-context learning?\.arXiv preprint arXiv:2502\.14010\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[52\]F\. Z\. Zhang, P\. Albert, C\. Rodriguez\-Opazo, A\. van den Hengel, and E\. Abbasnejad\(2024\)Knowledge composition using task vectors with learned anisotropic scaling\.Advances in Neural Information Processing Systems37,pp\. 67319–67354\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px1.p1.1)\.
- \[53\]X\. Zhang, J\. Zhao, and Y\. LeCun\(2015\)Character\-level convolutional networks for text classification\.Advances in neural information processing systems28\.Cited by:[§4\.3](https://arxiv.org/html/2605.20730#S4.SS3.SSS0.Px1.p1.5),[§6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px1.p1.3)\.
- \[54\]Y\. Zhang, F\. Zhang, Z\. Yang, and Z\. Wang\(2023\)What and how does in\-context learning learn? bayesian model averaging, parameterization, and generalization\.arXiv preprint arXiv:2305\.19420\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.
- \[55\]Y\. Zhou, J\. Li, Y\. Xiang, H\. Yan, L\. Gui, and Y\. He\(2024\)The mystery of in\-context learning: a comprehensive survey on interpretation and analysis\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 14365–14378\.Cited by:[§2](https://arxiv.org/html/2605.20730#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AAppendix

### A\.1Limitation and Future Work

We describe several limitations of our work along with corresponding directions for future work\. First, we adopt a linear mapping forLTVand empirically validate its effectiveness across a wide range of benchmarks and models\. At the same time, our experiments in Appendix[A\.7](https://arxiv.org/html/2605.20730#A1.SS7)show that more expressive mappings can further improve downstream accuracy, suggesting clear potential beyond the linear case\. Therefore, designing more expressive mappings that retain the efficiency of a closed\-form solution would thus be a promising direction for future work\.

Second, while our work investigates*why*a linear mapping is a reasonable choice and*how*effective it is, the question of*when*the linear approximation is effective remains unexplored\. A deeper theoretical characterization of these conditions would not only strengthen the foundation of our work, but also guide the design of mappings tailored to different settings\.

Third, our experiments primarily focus on tasks with relatively short outputs, such as single\-token classification and scalar regression\. ExtendingLTVto inference tasks that generate longer and more diverse token sequences – such as multi\-token reasoning, multi\-turn conversation, or summarization – would broaden its applicability and is a promising direction for future work\.

Finally, although we demonstrate in Sec\.[6\.2](https://arxiv.org/html/2605.20730#S6.SS2)thatLTVenables task vectors to be transferred from a larger model to a smaller one, this transfer currently relies on the assumption that the source and target models share the same tokenizer, which serves as a common coordinate system for the logit\-space regression\. ExtendingLTVto enable transfer across models with different tokenizers would substantially broaden its applicability and is another valuable direction for future work\.

### A\.2Proof of Proposition[5\.1](https://arxiv.org/html/2605.20730#S5.Thmtheorem1)

We provide detailed proof for the proposition[5\.1](https://arxiv.org/html/2605.20730#S5.Thmtheorem1)presented in Sec\.[5](https://arxiv.org/html/2605.20730#S5)\.

###### Proof\.

Recall that, throughout the paper,PPdenotes the next\-token distribution*restricted*to the label set𝒞\\mathcal\{C\}as defined in equation[2](https://arxiv.org/html/2605.20730#S3.E2)\. Accordingly, thePiclP\_\{\\text\{icl\}\}andPtvP\_\{\\text\{tv\}\}in the KL divergence indNTPd\_\{\\text\{NTP\}\}\(f\)\(f\)in equation[9](https://arxiv.org/html/2605.20730#S4.E9)are also computed over𝒞\\mathcal\{C\}\.

Given demonstrationsZZand a queryxx, recall that

𝒉icl≔𝒉icl\(x,Z\),𝒉tv≔𝒉tv\(x,𝒗\)\{\\bm\{h\}\}\_\{\\text\{icl\}\}\\coloneqq\{\\bm\{h\}\}\_\{\\text\{icl\}\}\(x,Z\),\\qquad\{\\bm\{h\}\}\_\{\\text\{tv\}\}\\coloneqq\{\\bm\{h\}\}\_\{\\text\{tv\}\}\(x,\{\\bm\{v\}\}\)denote the hidden states of the last token in the final\-layer under ICL and TV inference, respectively\.

Restricted logits\.For the LM head𝑾lm\{\\bm\{W\}\}\_\{\\text\{lm\}\}, define𝑾𝒞∈ℝK×d\{\\bm\{W\}\}\_\{\\mathcal\{C\}\}\\in\\mathbb\{R\}^\{K\\times d\}as the submatrix of𝑾lm\{\\bm\{W\}\}\_\{\\text\{lm\}\}formed by selecting the rows indexed by the label set𝒞\\mathcal\{C\}\(whereK=\|𝒞\|K=\|\\mathcal\{C\}\|\)\. Then

‖𝑾𝒞‖2≤‖𝑾lm‖2≤C1,\\\|\{\\bm\{W\}\}\_\{\\mathcal\{C\}\}\\\|\_\{2\}\\leq\\\|\{\\bm\{W\}\}\_\{\\text\{lm\}\}\\\|\_\{2\}\\leq C\_\{1\},since removing rows cannot increase the spectral norm\.

Define the*restricted*logits

𝒛≔𝑾𝒞𝒉icl,𝒛~≔𝑾𝒞𝒉tv,\{\\bm\{z\}\}\\coloneqq\{\\bm\{W\}\}\_\{\\mathcal\{C\}\}\\,\{\\bm\{h\}\}\_\{\\text\{icl\}\},\\qquad\\tilde\{\{\\bm\{z\}\}\}\\coloneqq\{\\bm\{W\}\}\_\{\\mathcal\{C\}\}\\,\{\\bm\{h\}\}\_\{\\text\{tv\}\},and the corresponding restricted predictive distributions

𝒑≔softmax\(𝒛\),𝒒≔softmax\(𝒛~\)\.\{\\bm\{p\}\}\\coloneqq\\mathrm\{softmax\}\(\{\\bm\{z\}\}\),\\qquad\{\\bm\{q\}\}\\coloneqq\\mathrm\{softmax\}\(\\tilde\{\{\\bm\{z\}\}\}\)\.Then the discrepancy for each queryxxappearing indNTP\(f\)\\text\{$d\_\{\\text\{NTP\}\}$\}\(f\)in equation[9](https://arxiv.org/html/2605.20730#S4.E9)is equal toDKL\(𝒑∥𝒒\)D\_\{\\mathrm\{KL\}\}\(\{\\bm\{p\}\}\\\|\{\\bm\{q\}\}\)\.

Step 1: Bounding the restricted logit distance\.By sub\-multiplicativity and the spectral\-norm bound,

‖𝒛−𝒛~‖2=‖𝑾𝒞\(𝒉icl−𝒉tv\)‖2≤‖𝑾𝒞‖2‖𝒉icl−𝒉tv‖2≤C1‖𝒉icl−𝒉tv‖2\.\\\|\{\\bm\{z\}\}\-\\tilde\{\{\\bm\{z\}\}\}\\\|\_\{2\}=\\\|\{\\bm\{W\}\}\_\{\\mathcal\{C\}\}\(\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{tv\}\}\)\\\|\_\{2\}\\leq\\\|\{\\bm\{W\}\}\_\{\\mathcal\{C\}\}\\\|\_\{2\}\\,\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{tv\}\}\\\|\_\{2\}\\leq C\_\{1\}\\,\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{tv\}\}\\\|\_\{2\}\.\(17\)
Step 2: Bounding the restricted KL divergence via Lipschitz log\-softmax\.Letϕ\(⋅\)≔log⁡softmax\(⋅\)\\phi\(\\cdot\)\\coloneqq\\log\\mathrm\{softmax\}\(\\cdot\)denote the log\-softmax map onℝK\\mathbb\{R\}^\{K\}\. By assumption,ϕ\\phiisC2C\_\{2\}\-Lipschitz inℓ2\\ell\_\{2\}:

‖ϕ\(𝒛\)−ϕ\(𝒛~\)‖2≤C2‖𝒛−𝒛~‖2\.\\\|\\phi\(\{\\bm\{z\}\}\)\-\\phi\(\\tilde\{\{\\bm\{z\}\}\}\)\\\|\_\{2\}\\leq C\_\{2\}\\\|\{\\bm\{z\}\}\-\\tilde\{\{\\bm\{z\}\}\}\\\|\_\{2\}\.\(18\)For any coordinatei∈\{1,…,K\}i\\in\\\{1,\\dots,K\\\},

\|log⁡𝒑i−log⁡𝒒i\|=\|ϕ\(𝒛\)i−ϕ\(𝒛~\)i\|≤‖ϕ\(𝒛\)−ϕ\(𝒛~\)‖2≤C2‖𝒛−𝒛~‖2\.\|\\log\{\\bm\{p\}\}\_\{i\}\-\\log\{\\bm\{q\}\}\_\{i\}\|=\|\\phi\(\{\\bm\{z\}\}\)\_\{i\}\-\\phi\(\\tilde\{\{\\bm\{z\}\}\}\)\_\{i\}\|\\leq\\\|\\phi\(\{\\bm\{z\}\}\)\-\\phi\(\\tilde\{\{\\bm\{z\}\}\}\)\\\|\_\{2\}\\leq C\_\{2\}\\\|\{\\bm\{z\}\}\-\\tilde\{\{\\bm\{z\}\}\}\\\|\_\{2\}\.Therefore,

DKL\(𝒑∥𝒒\)\\displaystyle D\_\{\\mathrm\{KL\}\}\(\{\\bm\{p\}\}\\\|\{\\bm\{q\}\}\)=∑i=1K𝒑i\(log⁡𝒑i−log⁡𝒒i\)≤∑i=1K𝒑i\|log⁡𝒑i−log⁡𝒒i\|\\displaystyle=\\sum\_\{i=1\}^\{K\}\{\\bm\{p\}\}\_\{i\}\(\\log\{\\bm\{p\}\}\_\{i\}\-\\log\{\\bm\{q\}\}\_\{i\}\)\\leq\\sum\_\{i=1\}^\{K\}\{\\bm\{p\}\}\_\{i\}\\,\|\\log\{\\bm\{p\}\}\_\{i\}\-\\log\{\\bm\{q\}\}\_\{i\}\|≤∑i=1K𝒑i\(C2‖𝒛−𝒛~‖2\)=C2‖𝒛−𝒛~‖2\.\\displaystyle\\leq\\sum\_\{i=1\}^\{K\}\{\\bm\{p\}\}\_\{i\}\\,\(C\_\{2\}\\\|\{\\bm\{z\}\}\-\\tilde\{\{\\bm\{z\}\}\}\\\|\_\{2\}\)=C\_\{2\}\\\|\{\\bm\{z\}\}\-\\tilde\{\{\\bm\{z\}\}\}\\\|\_\{2\}\.\(19\)
Step 3: Combining and taking expectation\.Combining equation[17](https://arxiv.org/html/2605.20730#A1.E17)and equation[19](https://arxiv.org/html/2605.20730#A1.E19)yields

DKL\(𝒑∥𝒒\)≤C1C2‖𝒉icl−𝒉tv‖2\.D\_\{\\mathrm\{KL\}\}\(\{\\bm\{p\}\}\\\|\{\\bm\{q\}\}\)\\leq C\_\{1\}C\_\{2\}\\,\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{tv\}\}\\\|\_\{2\}\.Taking expectation overx∼𝒟x\\sim\\mathcal\{D\}, we have

dNTP\(f\)=𝔼x∼𝒟\[DKL\(𝒑∥𝒒\)\]\\displaystyle\\text\{$d\_\{\\text\{NTP\}\}$\}\(f\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\big\[D\_\{\\mathrm\{KL\}\}\(\{\\bm\{p\}\}\\\|\{\\bm\{q\}\}\)\\big\]≤C1C2𝔼x∼𝒟\[‖𝒉icl−𝒉tv‖2\]\\displaystyle\\leq C\_\{1\}C\_\{2\}\\,\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\big\[\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{tv\}\}\\\|\_\{2\}\\big\]≤C1C2𝔼x∼𝒟\[‖𝒉icl−𝒉tv‖22\]\\displaystyle\\leq C\_\{1\}C\_\{2\}\\,\\sqrt\{\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\big\[\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{tv\}\}\\\|\_\{2\}^\{2\}\\big\]\}=C1C2ℒMSE\(f\)\.\\displaystyle=C\_\{1\}C\_\{2\}\\sqrt\{\\mathcal\{L\}\_\{\\text\{MSE\}\}\(f\)\}\.This completes the proof\. ∎

Table 6:Overview of the benchmark datasets used in our experiments, including their prompting templates, label sets, and licenses\. \{Sentence\} and \{Label\} are placeholders for input queryxxand labelyy, respectively\. All datasets are publicly available and used under their original licenses or standard research\-use terms\.DatasetPrompt TemplateLabelsLicenseAGNewsNews: \{Sentence\} Type: \{Label\}World / Sports / Business / TechnologyPublicly available for research purposesDBPediaInput: \{Sentence\} Label: \{Label\}company / school / artist / athlete / politics / transportation / building / nature / village / animal / plant / album / film / bookCC BY\-SA 3\.0HateSpeech18Text: \{Sentence\} Label: \{Label\}neutral / hateCC BY\-SA 3\.0MRReview: \{Sentence\} Sentiment: \{Label\}negative / positivePublicly available for research purposesSST\-2Review: \{Sentence\} Sentiment: \{Label\}negative / positiveCC BY 4\.0SST\-5Sentence: \{Sentence\} Sentiment: \{Label\}terrible / negative / neutral / positive / greatCC BY 4\.0SUBJSentence: \{Sentence\} Label: \{Label\}objective / subjectivePublicly available for research purposesTRECQuestion: \{Sentence\} Answer Type: \{Label\}Abbreviation / Entity / Person / Location / NumberPublicly available for research purposes

### A\.3Implementation Details on Evaluation

We provide detailed information about our evaluation setup\. Following experiments conducted inLiet al\.\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\], we employ the same prompt templates and label sets, as summarized in Table[6](https://arxiv.org/html/2605.20730#A1.T6)\. However, EmoC\[[7](https://arxiv.org/html/2605.20730#bib.bib84)\]is excluded as it is no longer publicly available\. For labels consisting of multiple tokens, we use the probability of the first token when computing equation[2](https://arxiv.org/html/2605.20730#S3.E2), followingHendelet al\.\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\], Toddet al\.\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\], Liet al\.\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]\. All experiments were conducted using a single NVIDIA GPU \(either RTX 6000 Ada Generation or GeForce RTX 4090\)\.

All experiments are conducted withk=30k=30demonstrationsZZ, sampled from the training dataset\. Note that for classification tasks, the label set𝒞\\mathcal\{C\}contains\|𝒞\|=K\\lvert\\mathcal\{C\}\\rvert=Klabels\. For both task vector extraction and ICL inference, we ensure that each label is represented by an equal number of examples inZZ\. Specifically, we include the maximum number of examples per label such that all labels are equally represented while staying within the limit ofkkdemonstrations\. For example, in the case of AGNews where the number of classesKKis 4, we include 7 examples per label under thek=30k=30setting, resulting in a total of 28 demonstrations\. This uniform label distribution is maintained across all experiments involving demonstrations\.

### A\.4Implementation Details on Baselines

In this section, we provide the details of each baseline along with our implementation\. For each baseline, we describe \(1\) the extraction phase, \(2\) the inference phase, and \(3\) the implementation details\. Note that these baselines typically require additional*labeled*examples for layer or head selection \(e\.g\.,Task Vector, Function Vector, and State Vector use a labeled validation set of 32 examples\), whereas our method only uses*unlabeled*queries\.

- •Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]\. In the extraction phase, task vectors are extracted from hidden states at the last token position\. In the inference phase, the original hidden state of the model at a specific layer is replaced with the extracted vector \(i\.e\.,𝐡tv=𝐯\{\\bm\{h\}\}\_\{\\text\{tv\}\}=\{\\bm\{v\}\}\)\. Following the original study, we select the optimal layer independently for each task using a validation set with 32*labeled*examples\.
- •Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]\. In the extraction phase, task vectors are extracted from the activations of attention heads that are selected based on their causal influence on increasing the probability of the correct label\. In the inference phase, the task vector is added to the hidden state at a chosen injection layer\. In our implementation, we follow the original paper and identify the top\-10 attention heads based on their causal influence\. We average their representations over 20 independent trials and select the optimal layer using a validation set with 32 labeled examples\. Detailed implementation refers to the official repository at[https://github\.com/ericwtodd/function\_vectors](https://github.com/ericwtodd/function_vectors)\.
- •State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]\. In the extraction phase, task vectors are formed by collecting attention activations at the separator tokens of each demonstration\. The activations are gathered across the initial layers\. In the inference phase, these stored activations are added to the model activations\. We employ the inner optimization strategy from the original paper, which averages activations extracted from multiple demonstration examples\. The optimal depth of the initial layers is selected for each task via a validation set with 32 labeled examples\. The method is implemented using the official code at[https://github\.com/HITsz\-TMG/ICL\-State\-Vector](https://github.com/HITsz-TMG/ICL-State-Vector)\.
- •I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]\. In the extraction phase, task vectors are extracted from the output activations of both attention and MLP modules at the last token position\. These activations are averaged over each demonstration example\. At inference time, a linear combination of these attention and MLP task vectors is additively injected into the residual stream at every layer\. In our experiments, we use the initial coefficient values \(λ=0\.1,β=1\\lambda=0\.1,\\beta=1\) for the linear combination as suggested in the original paper\. Code and detailed implementation instructions are available at[https://github\.com/LzVv123456/I2CL](https://github.com/LzVv123456/I2CL)\.

### A\.5Implementation Details on Regression Experiments

We provide the detailed experimental setup for the regression experiments reported in Table[4](https://arxiv.org/html/2605.20730#S6.T4)\. Following prior work in the in\-context learning literature\[[13](https://arxiv.org/html/2605.20730#bib.bib85),[50](https://arxiv.org/html/2605.20730#bib.bib34)\], we evaluateLTVon synthetic regression tasks\. For each task instance, the model is given 30 demonstration pairs\(𝒙,𝒚\)\(\{\\bm\{x\}\},\{\\bm\{y\}\}\)and is asked to predict the continuous scalar output for a query input\. We use a simple textual template where each demonstration is formatted as two lines:x:\[x1,…,xm\]x:\[x\_\{1\},\\ldots,x\_\{m\}\]followed byy:yiy:y\_\{i\}, with demonstrations separated by blank lines\.

Each input vector is sampled as𝒙∼𝒩\(0,𝑰m\)\{\\bm\{x\}\}\\sim\\mathcal\{N\}\(0,\{\\bm\{I\}\}\_\{m\}\)withm=6m=6\. We consider two regression settings: linear regression, where𝒚=𝒘⊤𝒙\{\\bm\{y\}\}=\{\\bm\{w\}\}^\{\\top\}\{\\bm\{x\}\}for𝒘∈ℝd\{\\bm\{w\}\}\\in\\mathbb\{R\}^\{d\}, and ReLU\[[1](https://arxiv.org/html/2605.20730#bib.bib78)\]regression, where𝒚=𝜶⊤ReLU\(𝑽𝒙\)\{\\bm\{y\}\}=\\bm\{\\alpha\}^\{\\top\}\\mathrm\{ReLU\}\(\{\\bm\{V\}\}\{\\bm\{x\}\}\)for𝑽∈ℝr×d\{\\bm\{V\}\}\\in\\mathbb\{R\}^\{r\\times d\},𝜶∈ℝr\\bm\{\\alpha\}\\in\\mathbb\{R\}^\{r\}, andr=100r=100\. For linear regression, the parameter is sampled asw∼𝒩\(0,Id\)w\\sim\\mathcal\{N\}\(0,I\_\{d\}\), following the isotropic Gaussian prior ofGarget al\.\[[13](https://arxiv.org/html/2605.20730#bib.bib85)\]\. For ReLU regression, we sampleV∼𝒩\(0,I\)V\\sim\\mathcal\{N\}\(0,I\)andα∼𝒩\(0,1rI\)\\alpha\\sim\\mathcal\{N\}\(0,\\frac\{1\}\{r\}I\)\. The performance is measured by mean squared error \(MSE\)\.

Unlike classification, where each label is a single token from a discrete label set, the label𝒚\{\\bm\{y\}\}here is a real\-valued scalar that the LLM must generate as a multi\-token sequence \(digits and decimal point\)\. To extendLTVto this multi\-token generation setting, we apply the linear mapping𝑾⋆\{\\bm\{W\}\}^\{\\star\}at every generation step\. At each step, the model takes the query and the tokens generated so far as input, and computes the zero\-shot hidden state𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}of the last token\. We then add the task vector𝑾⋆𝒉zs\{\\bm\{W\}\}^\{\\star\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}to it before feeding the result to the LM head\. The sampled token is appended to the input for the next step\. Note that𝑾⋆\{\\bm\{W\}\}^\{\\star\}is estimated once during extraction and reused across all generation steps, preserving the inference efficiency\.

### A\.6Implementation Details on Cross\-Model Transfer Experiments

We provide the detailed experimental setup and method for the cross\-model transfer experiments reported in Table[6\.1](https://arxiv.org/html/2605.20730#S6.SS1.SSS0.Px3)\. The proposedLTVmethod in Sec\.[5](https://arxiv.org/html/2605.20730#S5)is defined within a single model\. In practice, however, when the capacity of the model is not enough, ICL alone may not extract sufficient task information from the demonstrations, leading to limited quality of the task vector\.

We thus consider transferring task vectors extracted from a larger model – which captures richer ICL representations from the same demonstrations – to a smaller model, in order to improve its performance without requiring additional demonstrations or larger inference cost\. This transfer is non\-trivial because the two models have different hidden state dimensions and different LM heads, so the linear mapping𝑾⋆\{\\bm\{W\}\}^\{\\star\}in Sec\.[5](https://arxiv.org/html/2605.20730#S5)cannot be directly reused\.

##### Method\.

We extendLTVto the cross\-model setting with a single modification\. We replace the hidden\-space regression in equation[15](https://arxiv.org/html/2605.20730#S5.E15)with a logit\-space regression\. The intuition is that, while hidden state spaces differ across models, two models that share the same tokenizer share the vocabulary as a common coordinate system, making the logit space a natural target for alignment\. To be specific, let𝒛iclL≔𝑾lmL𝒉iclL\{\\bm\{z\}\}\_\{\\mathrm\{icl\}\}^\{L\}\\coloneqq\{\\bm\{W\}\}\_\{\\mathrm\{lm\}\}^\{L\}\{\\bm\{h\}\}\_\{\\mathrm\{icl\}\}^\{L\}denote the logit produced by the large model under ICL, and𝒛zsS≔𝑾lmS𝒉zsS\{\\bm\{z\}\}\_\{\\mathrm\{zs\}\}^\{S\}\\coloneqq\{\\bm\{W\}\}\_\{\\mathrm\{lm\}\}^\{S\}\{\\bm\{h\}\}\_\{\\mathrm\{zs\}\}^\{S\}the logit produced by the small model under zero\-shot inference, where the superscriptsLLandSSdenote quantities from the large and small model, respectively\.

We solve a ridge regression analogous to equation[15](https://arxiv.org/html/2605.20730#S5.E15), mapping𝒉zsS\{\\bm\{h\}\}\_\{\\mathrm\{zs\}\}^\{S\}to the logit\-space difference𝒛iclL−𝒛zsS\{\\bm\{z\}\}\_\{\\mathrm\{icl\}\}^\{L\}\-\{\\bm\{z\}\}\_\{\\mathrm\{zs\}\}^\{S\}, which yields a closed\-form solution𝑾~⋆∈ℝN𝒰×dS\\tilde\{\{\\bm\{W\}\}\}^\{\\star\}\\in\\mathbb\{R\}^\{N\_\{\\mathcal\{U\}\}\\times d\_\{S\}\}, wheredSd\_\{S\}is the hidden dimension of the small model andN𝒰N\_\{\\mathcal\{U\}\}is the vocabulary size\. At inference, the corrected logit is𝒛zsS\(𝒙test\)\+𝑾~⋆𝒉zsS\(𝒙test\)\{\\bm\{z\}\}\_\{\\mathrm\{zs\}\}^\{S\}\(\{\\bm\{x\}\}\_\{\\mathrm\{test\}\}\)\+\\tilde\{\{\\bm\{W\}\}\}^\{\\star\}\{\\bm\{h\}\}\_\{\\mathrm\{zs\}\}^\{S\}\(\{\\bm\{x\}\}\_\{\\mathrm\{test\}\}\)\. When the large and small models are identical, this reduces to the original formulation in equation[16](https://arxiv.org/html/2605.20730#S5.E16)up to the linear LM head transformation, ensuring that our extension preserves the design principle ofLTV\.

##### Setup\.

We use Qwen2\.5\-72B as the large model and Qwen2\.5\-7B as the small model, which share the same tokenizer\. We follow the eight classification benchmarks and prompt templates used in the main experiments in Sec\.[6](https://arxiv.org/html/2605.20730#S6)\. To target the regime where the small model cannot extract sufficient task information from demonstrations alone, we usek=10k=10demonstrations, compared tok=30k=30in the main experiments\. For DBPedia, we usek=14k=14since it has 14 classes\. We useN=256N=256unlabeled training queries andλ=5\.0\\lambda=5\.0for the ridge regression, following the default values in the main experiments\. We report the average accuracy over five independent runs\.

### A\.7Additional Experiments

Table 7:Complete results for Table[1](https://arxiv.org/html/2605.20730#S6.T1)in the main paper\. We report the classification accuracy \(%\) of LTV and four baseline task vector methods on eight benchmarks across all five LLMs evaluated in this work\. The accuracy of Zero\-shot and ICL inference is provided as reference inference modes\.LTVconsistently achieves the highest average accuracy \(Avg\.\) across all models\.ModelsMethodsAGNewsDBPediaHateSpeech18MRSST\-2SST\-5SubjTRECAvg\.Qwen\-2\.5\-7B\[[38](https://arxiv.org/html/2605.20730#bib.bib40)\]Zero\-shot73\.8074\.8058\.4068\.0054\.8020\.2048\.8066\.4058\.15ICL85\.40\(±\\pm1\.6\)97\.32\(±\\pm0\.3\)81\.76\(±\\pm1\.3\)93\.44\(±\\pm0\.5\)94\.40\(±\\pm0\.3\)48\.56\(±\\pm2\.2\)89\.68\(±\\pm1\.9\)87\.32\(±\\pm2\.6\)84\.74Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]73\.24\(±\\pm0\.4\)75\.56\(±\\pm0\.5\)58\.64\(±\\pm0\.5\)58\.96\(±\\pm1\.2\)58\.48\(±\\pm1\.8\)20\.20\(±\\pm0\.0\)58\.32\(±\\pm3\.5\)68\.08\(±\\pm0\.7\)58\.94Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]74\.76\(±\\pm2\.1\)78\.52\(±\\pm1\.8\)63\.92\(±\\pm3\.3\)87\.28\(±\\pm2\.9\)93\.76\(±\\pm1\.2\)23\.52\(±\\pm4\.7\)54\.96\(±\\pm6\.9\)74\.12\(±\\pm3\.2\)68\.86State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]78\.52\(±\\pm0\.3\)76\.88\(±\\pm0\.6\)57\.84\(±\\pm1\.3\)85\.84\(±\\pm2\.5\)86\.64\(±\\pm3\.6\)30\.60\(±\\pm0\.2\)70\.12\(±\\pm2\.7\)71\.64\(±\\pm6\.8\)69\.76I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]75\.60\(±\\pm2\.3\)75\.68\(±\\pm0\.4\)62\.80\(±\\pm0\.7\)79\.24\(±\\pm0\.4\)87\.36\(±\\pm0\.3\)22\.96\(±\\pm0\.3\)42\.72\(±\\pm0\.3\)56\.68\(±\\pm1\.8\)62\.88\\cellcolorgray\!15LTV\(Ours\)\\cellcolorgray\!1577\.32\(±\\pm6\.7\)\\cellcolorgray\!1590\.68\(±\\pm2\.2\)\\cellcolorgray\!1575\.24\(±\\pm1\.2\)\\cellcolorgray\!1590\.56\(±\\pm1\.3\)\\cellcolorgray\!1591\.76\(±\\pm1\.7\)\\cellcolorgray\!1531\.32\(±\\pm4\.2\)\\cellcolorgray\!1580\.68\(±\\pm5\.6\)\\cellcolorgray\!1583\.40\(±\\pm6\.2\)\\cellcolorgray\!1577\.62Qwen\-3\-8B\[[49](https://arxiv.org/html/2605.20730#bib.bib52)\]Zero\-shot57\.4074\.6071\.6092\.4092\.2034\.8066\.4068\.2069\.70ICL86\.56\(±\\pm2\.3\)97\.31\(±\\pm0\.6\)83\.44\(±\\pm1\.0\)93\.28\(±\\pm0\.5\)94\.48\(±\\pm0\.5\)53\.08\(±\\pm2\.3\)93\.36\(±\\pm0\.7\)85\.24\(±\\pm4\.8\)85\.84Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]61\.68\(±\\pm0\.8\)77\.74\(±\\pm0\.3\)71\.56\(±\\pm0\.4\)90\.24\(±\\pm1\.0\)91\.36\(±\\pm0\.2\)34\.88\(±\\pm0\.1\)67\.20\(±\\pm2\.2\)75\.56\(±\\pm0\.7\)71\.28Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]79\.24\(±\\pm7\.0\)81\.49\(±\\pm1\.4\)74\.20\(±\\pm0\.9\)92\.36\(±\\pm0\.2\)93\.16\(±\\pm0\.6\)34\.08\(±\\pm0\.4\)76\.00\(±\\pm2\.5\)73\.56\(±\\pm7\.9\)75\.51State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]81\.24\(±\\pm4\.2\)84\.40\(±\\pm0\.9\)69\.92\(±\\pm2\.0\)92\.44\(±\\pm0\.1\)92\.24\(±\\pm0\.4\)33\.92\(±\\pm1\.1\)75\.56\(±\\pm6\.8\)78\.32\(±\\pm2\.6\)76\.01I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]62\.92\(±\\pm0\.2\)74\.51\(±\\pm0\.3\)70\.12\(±\\pm0\.6\)92\.40\(±\\pm0\.1\)93\.24\(±\\pm0\.2\)34\.92\(±\\pm0\.1\)59\.76\(±\\pm0\.3\)78\.36\(±\\pm0\.9\)70\.78\\cellcolorgray\!15LTV\(Ours\)\\cellcolorgray\!1577\.80\(±\\pm2\.1\)\\cellcolorgray\!1594\.20\(±\\pm1\.7\)\\cellcolorgray\!1576\.04\(±\\pm3\.4\)\\cellcolorgray\!1586\.72\(±\\pm6\.3\)\\cellcolorgray\!1590\.40\(±\\pm2\.2\)\\cellcolorgray\!1539\.40\(±\\pm7\.1\)\\cellcolorgray\!1590\.00\(±\\pm2\.2\)\\cellcolorgray\!1581\.96\(±\\pm4\.6\)\\cellcolorgray\!1579\.57LLaMA\-2\-7B\[[42](https://arxiv.org/html/2605.20730#bib.bib28)\]Zero\-shot72\.4073\.6054\.0073\.0080\.4028\.4051\.6050\.0060\.43ICL84\.48\(±\\pm5\.1\)94\.44\(±\\pm2\.7\)64\.60\(±\\pm9\.0\)93\.72\(±\\pm0\.6\)93\.52\(±\\pm1\.2\)42\.88\(±\\pm3\.3\)54\.48\(±\\pm6\.5\)79\.36\(±\\pm4\.0\)75\.94Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]71\.64\(±\\pm0\.7\)73\.88\(±\\pm1\.2\)55\.80\(±\\pm0\.6\)74\.72\(±\\pm0\.9\)78\.60\(±\\pm0\.1\)27\.84\(±\\pm0\.8\)50\.04\(±\\pm0\.2\)59\.28\(±\\pm4\.2\)61\.48Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]76\.68\(±\\pm2\.6\)78\.68\(±\\pm0\.6\)71\.28\(±\\pm2\.8\)78\.64\(±\\pm6\.8\)81\.60\(±\\pm5\.6\)32\.24\(±\\pm3\.6\)48\.72\(±\\pm2\.6\)58\.84\(±\\pm6\.8\)65\.84State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]74\.36\(±\\pm8\.4\)90\.36\(±\\pm1\.0\)64\.12\(±\\pm3\.7\)88\.48\(±\\pm3\.8\)82\.56\(±\\pm7\.2\)33\.68\(±\\pm8\.4\)50\.20\(±\\pm3\.7\)62\.20\(±\\pm7\.5\)68\.25I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]72\.96\(±\\pm1\.1\)76\.00\(±\\pm0\.1\)51\.48\(±\\pm0\.4\)74\.16\(±\\pm1\.0\)80\.16\(±\\pm0\.1\)34\.72\(±\\pm0\.6\)51\.16\(±\\pm0\.1\)60\.24\(±\\pm0\.5\)62\.61\\cellcolorgray\!15LTV\(Ours\)\\cellcolorgray\!1583\.96\(±\\pm4\.4\)\\cellcolorgray\!1588\.84\(±\\pm1\.6\)\\cellcolorgray\!1555\.72\(±\\pm5\.2\)\\cellcolorgray\!1591\.20\(±\\pm2\.5\)\\cellcolorgray\!1590\.12\(±\\pm0\.9\)\\cellcolorgray\!1538\.68\(±\\pm2\.0\)\\cellcolorgray\!1550\.32\(±\\pm0\.3\)\\cellcolorgray\!1570\.68\(±\\pm5\.4\)\\cellcolorgray\!1571\.19LLaMA\-2\-13B\[[42](https://arxiv.org/html/2605.20730#bib.bib28)\]Zero\-shot73\.8076\.2055\.2061\.2066\.6020\.8049\.8069\.6059\.15ICL88\.04\(±\\pm1\.7\)96\.88\(±\\pm1\.1\)77\.40\(±\\pm2\.2\)94\.44\(±\\pm0\.5\)94\.80\(±\\pm0\.7\)46\.52\(±\\pm2\.7\)82\.28\(±\\pm7\.1\)83\.36\(±\\pm5\.2\)82\.97Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]74\.00\(±\\pm1\.2\)76\.40\(±\\pm7\.5\)54\.08\(±\\pm0\.4\)61\.84\(±\\pm3\.4\)72\.44\(±\\pm6\.0\)21\.40\(±\\pm1\.2\)50\.00\(±\\pm0\.0\)69\.96\(±\\pm1\.1\)60\.02Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]79\.84\(±\\pm2\.9\)80\.72\(±\\pm1\.9\)71\.48\(±\\pm6\.7\)83\.16\(±\\pm0\.8\)82\.80\(±\\pm4\.2\)33\.64\(±\\pm2\.1\)49\.72\(±\\pm0\.1\)73\.84\(±\\pm5\.3\)69\.40State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]84\.64\(±\\pm4\.0\)89\.48\(±\\pm4\.9\)58\.80\(±\\pm8\.8\)89\.20\(±\\pm2\.8\)87\.20\(±\\pm3\.6\)36\.08\(±\\pm3\.0\)49\.40\(±\\pm0\.6\)66\.96\(±\\pm1\.0\)70\.22I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]78\.88\(±\\pm0\.3\)79\.00\(±\\pm0\.5\)54\.48\(±\\pm0\.1\)60\.68\(±\\pm0\.3\)64\.80\(±\\pm0\.4\)25\.96\(±\\pm0\.1\)50\.00\(±\\pm0\.0\)69\.12\(±\\pm0\.4\)60\.37\\cellcolorgray\!15LTV\(Ours\)\\cellcolorgray\!1586\.68\(±\\pm1\.7\)\\cellcolorgray\!1593\.20\(±\\pm0\.5\)\\cellcolorgray\!1572\.08\(±\\pm2\.2\)\\cellcolorgray\!1590\.20\(±\\pm2\.0\)\\cellcolorgray\!1588\.96\(±\\pm2\.1\)\\cellcolorgray\!1541\.88\(±\\pm2\.2\)\\cellcolorgray\!1578\.76\(±\\pm2\.7\)\\cellcolorgray\!1583\.92\(±\\pm6\.3\)\\cellcolorgray\!1579\.46LLaMA\-3\.1\-8B\[[11](https://arxiv.org/html/2605.20730#bib.bib39)\]Zero\-shot75\.0069\.0060\.6082\.2086\.8025\.6059\.2049\.4063\.48ICL87\.16\(±\\pm1\.2\)97\.68\(±\\pm0\.8\)74\.32\(±\\pm8\.1\)94\.52\(±\\pm0\.4\)94\.20\(±\\pm1\.1\)48\.32\(±\\pm1\.1\)85\.96\(±\\pm5\.3\)79\.08\(±\\pm7\.3\)82\.66Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]76\.16\(±\\pm0\.1\)69\.44\(±\\pm1\.0\)63\.00\(±\\pm0\.2\)83\.24\(±\\pm0\.3\)87\.32\(±\\pm0\.6\)26\.20\(±\\pm1\.3\)59\.52\(±\\pm0\.2\)69\.12\(±\\pm0\.6\)66\.75Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]81\.36\(±\\pm3\.7\)83\.24\(±\\pm1\.5\)67\.64\(±\\pm2\.0\)83\.48\(±\\pm3\.5\)87\.88\(±\\pm0\.3\)34\.76\(±\\pm1\.9\)61\.12\(±\\pm1\.4\)66\.48\(±\\pm1\.0\)70\.75State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]80\.28\(±\\pm4\.3\)80\.80\(±\\pm1\.5\)65\.40\(±\\pm1\.3\)85\.96\(±\\pm4\.9\)84\.12\(±\\pm0\.6\)36\.60\(±\\pm1\.6\)62\.52\(±\\pm3\.5\)67\.28\(±\\pm1\.9\)70\.37I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]76\.76\(±\\pm0\.4\)72\.56\(±\\pm0\.4\)62\.24\(±\\pm0\.5\)85\.24\(±\\pm0\.3\)90\.80\(±\\pm0\.2\)32\.48\(±\\pm0\.4\)62\.28\(±\\pm0\.1\)49\.00\(±\\pm0\.5\)66\.42\\cellcolorgray\!15LTV\(Ours\)\\cellcolorgray\!1582\.84\(±\\pm2\.5\)\\cellcolorgray\!1593\.36\(±\\pm1\.4\)\\cellcolorgray\!1570\.44\(±\\pm6\.0\)\\cellcolorgray\!1588\.68\(±\\pm1\.0\)\\cellcolorgray\!1590\.08\(±\\pm1\.2\)\\cellcolorgray\!1538\.20\(±\\pm3\.4\)\\cellcolorgray\!1567\.16\(±\\pm11\.4\)\\cellcolorgray\!1572\.88\(±\\pm7\.2\)\\cellcolorgray\!1575\.46

##### Extended results of Table[1](https://arxiv.org/html/2605.20730#S6.T1)\.

Table[7](https://arxiv.org/html/2605.20730#A1.T7)provides the complete results across all five LLMs evaluated in this work\. Consistent with the trend reported in the main body,LTVachieves the highest average accuracy on all models, demonstrating that its effectiveness generalizes across different model families and scales\.

##### Design choices for the task vector\.

In Sec\.[5\.1](https://arxiv.org/html/2605.20730#S5.SS1)of the main body, we propose to solve the proxy problem in equation[14](https://arxiv.org/html/2605.20730#S5.E14)by minimizingℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}in equation[11](https://arxiv.org/html/2605.20730#S5.E11)\. We interpret the proxy problem as finding a task vector𝒗=f\(x,Z\)\{\\bm\{v\}\}=f\(x,Z\)that compensates for the effect of demonstrations in the hidden state\. Specifically, the goal of using task vectors is to approximate the hidden state difference𝒉icl−𝒉zs\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}between ICL and zero\-shot inference

minf⁡𝔼x∼𝒟\[‖𝒉icl\(x,Z\)−𝒉zs\(x\)−f\(x,Z\)‖22\]\.\\min\_\{f\}\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\Big\[\\big\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\(x,Z\)\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\-f\(x,Z\)\\big\\\|\_\{2\}^\{2\}\\Big\]\.
In the main paper, we considerffas a mapping from𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}to the target𝒉icl−𝒉zs\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}and adopt a linear mappingf\(x,Z\)=𝑾𝒉zs\(x\)f\(x,Z\)=\{\\bm\{W\}\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)for its closed\-form solution

min𝑾⁡𝔼x∼𝒟\[‖𝒉icl\(x,Z\)−𝒉zs\(x\)−𝑾𝒉zs\(x\)‖22\]\.\\min\_\{\{\\bm\{W\}\}\}\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\Big\[\\big\\\|\{\\bm\{h\}\}\_\{\\text\{icl\}\}\(x,Z\)\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\-\{\\bm\{W\}\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\\big\\\|\_\{2\}^\{2\}\\Big\]\.Here, we compare this choice against alternative designs and validate its effectiveness\.

Given the strong performance of our proposedLTVmethod as shown in Sec\.[6\.2](https://arxiv.org/html/2605.20730#S6.SS2), the choice of employing a linear mapping naturally raises two questions: \(1\) Is it necessary to modelffas a mapping, the output of which is dependent on the input𝒉zs\(x\)\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\), or does modelingffas a simple constant vector suffice? \(2\) How effective is our linear mapping compared to a more expressive alternative trained via gradient descent? To answer these two questions, we compare our linear mapping approach against a constant mapping baseline and a more expressive alternative \(2\-layer MLP\-based mapping\):

- •Constant Mapping:Modeling the output offfas a fixed vector𝒄∈ℝd\{\\bm\{c\}\}\\in\\mathbb\{R\}^\{d\}independent of the queryxx,i\.e\.,f\(x,Z\)=𝐜f\(x,Z\)=\{\\bm\{c\}\}\. The optimal solution is given by the empirical mean of𝒉icl−𝒉zs\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}, represented as 𝒄∗=1N∑i=1N\(𝒉icl\(xi,Z\)−𝒉zs\(xi\)\)\{\\bm\{c\}\}^\{\*\}=\\tfrac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\big\(\{\\bm\{h\}\}\_\{\\text\{icl\}\}\(x\_\{i\},Z\)\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\_\{i\}\)\\big\)whereNNunlabeled train queries\{xi\}i=1N\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}are used\.
- •Linear Mapping \(Ours\):Mapping from𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}to the target𝒉icl−𝒉zs\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}using a linear transformation,i\.e\.,f\(x,Z\)=𝐖𝐡zs\(x\)f\(x,Z\)=\{\\bm\{W\}\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\), where𝑾∈ℝd×d\{\\bm\{W\}\}\\in\\mathbb\{R\}^\{d\\times d\}\. This has a closed\-form solution when we formulate it as a ridge regression problem; see Section[5\.2](https://arxiv.org/html/2605.20730#S5.SS2)\.
- •2\-layer MLP\-based Mapping:Mapping from𝒉zs\{\\bm\{h\}\}\_\{\\text\{zs\}\}to the target𝒉icl−𝒉zs\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}using a 2\-layer ReLU\[[1](https://arxiv.org/html/2605.20730#bib.bib78)\]network:f\(x,Z\)=𝑾2σReLU\(𝑾1𝒉zs\(x\)\)f\(x,Z\)=\{\\bm\{W\}\}\_\{2\}\\,\\sigma\_\{\\text\{ReLU\}\}\(\{\\bm\{W\}\}\_\{1\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\), where𝑾1,𝑾2∈ℝd×d\{\\bm\{W\}\}\_\{1\},\{\\bm\{W\}\}\_\{2\}\\in\\mathbb\{R\}^\{d\\times d\}are learnable parameters andσReLU\\sigma\_\{\\text\{ReLU\}\}denotes the ReLU activation function\. This is more expressive than the linear mapping, but requires iterative optimization\. We train this network by minimizing theℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}loss using AdamW\[[30](https://arxiv.org/html/2605.20730#bib.bib83)\]with learning rate10−310^\{\-3\}, cosine scheduler with warmup ratio 0\.1, batch size 8, and 20 epochs overN=256N=256unlabeled train queries withk=30k=30demonstrationsZZ\.

Table 8:Comparison of design choices for the task vectorf\(x,Z\)f\(x,Z\)that maps𝒉zs\(x\)\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)to the estimated hidden state difference𝒉icl−𝒉zs\{\\bm\{h\}\}\_\{\\text\{icl\}\}\-\{\\bm\{h\}\}\_\{\\text\{zs\}\}\. We compare three ways of designing such mapping: theconstant mappinguses a fixed vectorf\(x,Z\)=𝒄f\(x,Z\)=\{\\bm\{c\}\}, thelinear mappingwhich uses a linear transformationf\(x,Z\)=𝑾𝒉zs\(x\)f\(x,Z\)=\{\\bm\{W\}\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\), and the2\-layer MLP\-based mappingwhich uses a ReLU networkf\(x,Z\)=𝑾2σReLU\(𝑾1𝒉zs\(x\)\)f\(x,Z\)=\{\\bm\{W\}\}\_\{2\}\\sigma\_\{\\text\{ReLU\}\}\(\{\\bm\{W\}\}\_\{1\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\)\. Note that thelinear mappingis used in our proposed LTV method in Sec\.[5](https://arxiv.org/html/2605.20730#S5)\. We report the accuracy averaged over 8 classification benchmarks used in Sec\.[6](https://arxiv.org/html/2605.20730#S6), our proposeddNTPd\_\{\\text\{NTP\}\}metric andℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}\. Experiments are conducted on LLaMA\-3\.1\-8B\[[11](https://arxiv.org/html/2605.20730#bib.bib39)\]\.Design Choicesf\(x,Z\)f\(x,Z\)Avg\. Acc\.↑\\uparrowdNTPd\_\{\\text\{NTP\}\}↓\\downarrowℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}↓\\downarrowConstant Mapping𝒄\{\\bm\{c\}\}55\.720\.6942\.84\\cellcolorgray\!15Linear Mapping \(Ours\)\\cellcolorgray\!15𝑾𝒉zs\(x\)\{\\bm\{W\}\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\\cellcolorgray\!1575\.21\\cellcolorgray\!150\.147\\cellcolorgray\!151\.222\-layer MLP based Mapping𝑾2σReLU\(𝑾1𝒉zs\(x\)\)\{\\bm\{W\}\}\_\{2\}\\sigma\_\{\\text\{ReLU\}\}\(\{\\bm\{W\}\}\_\{1\}\{\\bm\{h\}\}\_\{\\text\{zs\}\}\(x\)\)77\.740\.1320\.64

![Refer to caption](https://arxiv.org/html/2605.20730v1/x6.png)Figure 6:Extraction time cost versus downstream accuracy ofLTVand MLP\-based mappings with varying depth\. All methods are tested on LLaMA\-3\.1\-8B, and accuracy is averaged over eight classification benchmarks used in the main paper in Sec\.[6](https://arxiv.org/html/2605.20730#S6)\.As shown in Table[8](https://arxiv.org/html/2605.20730#A1.T8), the constant mapping achieves only 55\.72% accuracy withℒMSE=2\.84\\mathcal\{L\}\_\{\\text\{MSE\}\}=2\.84, performing significantly worse than our proposed linear mapping \(75\.21% accuracy,ℒMSE=1\.22\\mathcal\{L\}\_\{\\text\{MSE\}\}=1\.22\) by a margin of 19\.49% in accuracy and 1\.62 in MSE\. This confirms that a query\-dependent mapping is essential for approximating the effect of demonstrations, as a fixed hidden state vector cannot capture how this effect varies across queries\.

On the other hand, the 2\-layer MLP\-based mapping further reducesℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}to 0\.64 and improves accuracy to 77\.74%, indicating that more expressive mappings can yield additional gains within our framework\. This observation suggests that our optimization formulation in equation[12](https://arxiv.org/html/2605.20730#S5.E12)naturally extends beyond the linear case\. In this paper, we focus on the linear mapping as the simplest instantiation that admits an efficient closed\-form solution and preserves the training\-free nature of ICL, and leave non\-linear extensions that require iterative optimization for future work\. Below, we provide a preliminary cost\-benefit analysis of such non\-linear mappings when training is allowed, to characterize the trade\-off between expressiveness and extraction cost\.

##### Further discussion on scaling of the non\-linear mappings\.

To further explore how non\-linear mappings scale with model capacity, we train deeper MLP\-based mappings \(4, 8, and 16 layers\), each with skip connections every two layers to preserve the residual stream\. Fig\.[6](https://arxiv.org/html/2605.20730#A1.F6)comparesLTVand these MLP\-based variants in terms of both extraction \(training for MLP\) time cost and classification accuracy\. While deeper MLPs achieve higher absolute accuracy \(up to 79\.36% with 16 layers\), the gain comes with substantially increased extraction cost: a 16\-layer MLP requires roughly5\.5×5\.5\\timesthe extraction time ofLTVfor 4\.15% absolute accuracy improvement\.

![Refer to caption](https://arxiv.org/html/2605.20730v1/x7.png)Figure 7:Extraction time cost versus downstream accuracy ofLTVand PEFT\-based methods \(prompt tuning\[[23](https://arxiv.org/html/2605.20730#bib.bib89)\]and LoRA\[[18](https://arxiv.org/html/2605.20730#bib.bib90)\]\), evaluated under three numbers of demonstrations \(k=10,30,50k=10,30,50\)\. All methods are tested on LLaMA\-3\.1\-8B, and accuracy is averaged over eight classification benchmarks used in the main paper in Sec\.[6](https://arxiv.org/html/2605.20730#S6)\.Table 9:Extended results of Table[1](https://arxiv.org/html/2605.20730#S6.T1)on LLaMA\-3\.1\-8B under the number of demonstrationsk=10k=10setting\. The experimental setup is otherwise identical to Table[1](https://arxiv.org/html/2605.20730#S6.T1)\.MethodsAGNewsDBpediaHateSpeech18MRSST\-2SST\-5SubjTRECAvg\.Zero\-shot75\.0069\.0060\.6082\.2086\.8025\.6059\.2049\.4063\.48ICL87\.5296\.4473\.0094\.6494\.3246\.2081\.8480\.2081\.77Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]75\.7269\.8063\.8082\.0486\.8025\.5659\.5667\.4066\.34Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]80\.1681\.8468\.6884\.2487\.0831\.6859\.6466\.3269\.95State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]84\.2880\.8067\.2490\.8079\.1633\.4867\.2060\.6470\.45I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]76\.6472\.4462\.0085\.2090\.6432\.1261\.5248\.4866\.13\\rowcolorgray\!15LTV\(Ours\)82\.2090\.0464\.4489\.4488\.9241\.6069\.6072\.9274\.89

Table 10:Extended results of Table[1](https://arxiv.org/html/2605.20730#S6.T1)on LLaMA\-3\.1\-8B under the number of demonstrationsk=50k=50setting\. The experimental setup is otherwise identical to Table[1](https://arxiv.org/html/2605.20730#S6.T1)\.MethodsAGNewsDBPediaHateSpeech18MRSST\-2SST\-5SubjTRECAvg\.Zero\-shot75\.0069\.0060\.6082\.2086\.8025\.6059\.2049\.4063\.48ICL87\.8497\.2478\.9294\.6094\.5250\.2890\.6885\.6884\.97Function Vector\[[41](https://arxiv.org/html/2605.20730#bib.bib32)\]76\.5669\.9263\.5683\.0087\.4826\.1659\.4867\.0866\.66Task Vector\[[16](https://arxiv.org/html/2605.20730#bib.bib29)\]77\.6082\.3665\.6081\.7288\.2432\.2459\.4467\.6069\.35State Vector\[[24](https://arxiv.org/html/2605.20730#bib.bib41)\]83\.5681\.1265\.6084\.2879\.0433\.6461\.4867\.7269\.56I2CL\[[28](https://arxiv.org/html/2605.20730#bib.bib31)\]77\.1672\.8462\.3285\.5290\.6432\.4461\.9649\.4066\.53\\rowcolorgray\!15LTV\(Ours\)83\.9292\.4474\.5287\.9289\.8041\.8069\.8876\.7277\.12

##### Further discussion on comparison with PEFT\-based methods\.

We further examine alternative ways of minimizing our proposed surrogateℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}by employing parameter\-efficient fine\-tuning \(PEFT\) methods, instead of the closed\-form linear mapping used inLTV\. Specifically, we consider Prompt Tuning\[[23](https://arxiv.org/html/2605.20730#bib.bib89)\]and LoRA\[[18](https://arxiv.org/html/2605.20730#bib.bib90)\], two widely used PEFT approaches that introduce a small number of trainable parameters while keeping the backbone frozen, and train them to minimize the sameℒMSE\\mathcal\{L\}\_\{\\text\{MSE\}\}objective in equation[11](https://arxiv.org/html/2605.20730#S5.E11)\. We evaluate all methods under three demonstration budgets \(k=10,30,50k=10,30,50\), and report the extraction \(training for PEFT\) time cost and the classification accuracy in Fig\.[7](https://arxiv.org/html/2605.20730#A1.F7)\. In the few\-shot regime \(k=10k=10\),LTVoutperforms LoRA in both accuracy and extraction efficiency\. When more demonstrations are available \(k=30,50k=30,50\), LoRA achieves higher accuracy thanLTV, but at roughly33times the extraction cost\. These results indicate that our method provides a competitive edge compared to PEFT\-based methods, particularly in the low\-data regime\.

##### Effect of the number of demonstrations\.

To verify the robustness ofLTVto the number of demonstrationskk, we also report results underk=10k=10\(Table[9](https://arxiv.org/html/2605.20730#A1.T9)\) andk=50k=50\(Table[10](https://arxiv.org/html/2605.20730#A1.T10)\) on LLaMA\-3\.1\-8B\. In both settings,LTVconsistently achieves the highest average accuracy among task vector baselines across both settings, demonstrating its robustness to the number of demonstrationskk\.
Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

Similar Articles

Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning

In-Context Learning Operates as Concept Subspace Learning

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Submit Feedback

Similar Articles

Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
In-Context Learning Operates as Concept Subspace Learning
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models