Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval
Summary
This paper introduces DOPA, a demonstration search framework that uses an out-of-distribution proxy to retrieve robust demonstrations for LLMs when the target domain is inaccessible, enhancing in-context learning performance under distribution shift.
View Cached Full Text
Cached at: 06/02/26, 03:34 PM
# Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval
Source: [https://arxiv.org/html/2606.00014](https://arxiv.org/html/2606.00014)
Hao Xu1, Rite Bo1, Fausto Giunchiglia1,2, Yingji Li1, Rui Song1 1College of Computer Science and Technology, Jilin University, China 2Department of Information Engineering and Computer Science, University of Trento, Italy \{xuhao,yingjili,songrui\}@jlu\.edu\.cn, bort24@mails\.jlu\.edu\.cn, fausto\.giunchiglia@unitn\.it
###### Abstract
Although studies have demonstrated that Large Language Models \(LLMs\) can perform well on Out\-of\-Distribution \(OOD\) tasks, their advantage tends to diminish as the distribution shift becomes more severe\. Consequently, researchers aim to retrieve distributionally similar and informative demonstrations from the available source domain to boost the inference capabilities of LLMs\. However, in practical scenarios where the target domain is inaccessible, evaluating the unknown distribution is challenging, which indirectly impacts the quality of the selected demonstrations\. To address this problem, we proposeDOPA, a demonstration search framework that incorporates an OOD proxy to approximate the inaccessible target domain and guide the retrieval process\. Building on proxy\-based evaluation, DOPA further introduces a Mahalanobis distance\-based global diversity constraint to ensure sufficient diversity among the retrieved demonstrations\. Experimental results on multiple LLMs and tasks demonstrate that DOPA effectively enhances robustness in OOD settings111https://github\.com/bort64/ood\_code\.
Toward Robust In\-Context Learning: Leveraging Out\-of\-distribution Proxies for Target Inaccessible Demonstration Retrieval
Hao Xu1, Rite Bo1, Fausto Giunchiglia1,2, Yingji Li1, Rui Song1††thanks:Corresponding author\.1College of Computer Science and Technology, Jilin University, China2Department of Information Engineering and Computer Science, University of Trento, Italy\{xuhao,yingjili,songrui\}@jlu\.edu\.cn,bort24@mails\.jlu\.edu\.cn, fausto\.giunchiglia@unitn\.it
## 1Introduction
Large language models \(LLMs\) have achieved strong performance across diverse NLP tasksChanget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib1)\); Songet al\.\([2025](https://arxiv.org/html/2606.00014#bib.bib2)\), with in\-context learning \(ICL\) emerging as a widely used prompting paradigmMinet al\.\([2022](https://arxiv.org/html/2606.00014#bib.bib3)\)\. By providing a small set of demonstrations, ICL can effectively guide model reasoning and prediction\. However, recent studies show that LLM performance degrades markedly under out\-of\-distribution \(OOD\) settingsYuanet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib18)\); Wanget al\.\([2025](https://arxiv.org/html/2606.00014#bib.bib28)\), particularly when demonstrations are distributionally mismatched with the target domain, motivating research into more robust demonstration selection strategies\.
RetrievalLuoet al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib4)\)and augmentationShuet al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib5)\)are two commonly used approaches for obtaining effective samples\. The former searches for the most relevant examples within a specific domain, while the latter rewrites existing samples to reduce their discrepancy with the target instance\. Demonstration retrieval relies on a retriever\. Some off\-the\-shelf metrics, such as Bm25Agrawalet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib6)\), sentence encoder\-based similarityLiuet al\.\([2022](https://arxiv.org/html/2606.00014#bib.bib7)\), model influencePenget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib14)\); S\.et al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib9)\), and misconfidenceXu and Zhang \([2024](https://arxiv.org/html/2606.00014#bib.bib8)\), can support general\-purpose retrieval strategies\. Meanwhile, other approaches aim to train a dense retriever to obtain more task\-relevant retrieval resultsChenget al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib10)\); Liet al\.\([2023a](https://arxiv.org/html/2606.00014#bib.bib11)\)\. Augmentation, on the other hand, focuses on adapting existing samples to better match the distributional characteristics of the target instanceO’Brienet al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib12)\); Madine \([2024](https://arxiv.org/html/2606.00014#bib.bib13)\)\.However, in real\-world applications, an inaccessible target domain hinders the ability to obtain domain\-aligned demonstrations, often resulting in degraded performanceSonget al\.\([2024a](https://arxiv.org/html/2606.00014#bib.bib15)\)\.
To address the aforementioned challenge, we propose ademonstrationoptimization framework based on OODproxyassessment \(termedDOPA\)\. This framework quantifies the utility of source\-domain samples in the absence of target\-domain access, and leverages the quantification results to guide demonstration retrieval\. At its core, DOPA introduces an OOD proxy as a principled approximation to the unknown target distributionZhang and Wischik \([2022](https://arxiv.org/html/2606.00014#bib.bib16)\), which is composed of two components: a source proxy and a target proxy\. The source proxy is defined as an instruction\-tuned LLM trained on the source domain to fully adapt to the source distribution, while the target proxy corresponds to the original, unmodified version of the same LLM\. The perplexity ratio between their predictions on identical input samples is adopted as the OOD score for those samplesNalisnicket al\.\([2019](https://arxiv.org/html/2606.00014#bib.bib17)\)\. This OOD score serves to estimate the degree of familiarity of source\-domain samples with the target domain in the absence of target\-domain information\. It is further integrated with representational similarity to predicted samples for candidate selection\. The validity of the OOD score is theoretically supported through a bounded proxy error analysis\. Moreover, to enhance the diversity of retrieved demonstrations, we incorporate a Mahalanobis distance\-based search strategy into the retrieval process\. By relying on the OOD proxy, DOPA is capable of identifying informative demonstrations solely within the source domain, without requiring any samples from the target domain\. Extensive experiments show that DOPA consistently outperforms baseline approaches across diverse LLMs and natural language understanding tasks\. In addition, we provide a multi\-dimensional analysis that demonstrates the effectiveness of the proxy in selecting samples that exhibit behavioral similarity to those in the target domain\. Our contributions are as follows:
\(i\) We propose a method that leverages OOD proxies to extract distribution\-aligned samples, and we theoretically demonstrate the soundness of the proxy through a bounded proxy error guarantee\. \(ii\) We propose a target\-agnostic demonstration retrieval framework based on OOD proxies, which combines proxy results and contextual diversity to enhance the quality of demonstration selection\. \(iii\) Experimental results on multiple NLP tasks across various LLMs demonstrate that DOPA effectively enhances OOD robustness in ICL\.
## 2Related Work
Demonstration Retrieval\. Despite the impressive performance demonstrated by ICL, an increasing number of studies have shown its sensitivity to the choice of demonstrationsSonget al\.\([2024b](https://arxiv.org/html/2606.00014#bib.bib23)\)\. To obtain more effective demonstrations, a natural idea is to search over candidate samples within a constrained spaceLuoet al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib4)\)\. Depending on whether the retrieval tool has been trained, demonstration search can be divided into off\-the\-shelf retrieval and retrieval based on fine\-tuned models\. Term\-based similarity has been widely used for demonstration retrieval, with BM25 being one of the most popular scoring metricsAgrawalet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib6)\); Yeet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib24)\)\. In addition, several sentence embedding models, such as SBERTWanget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib25)\), RoBERTaLiuet al\.\([2022](https://arxiv.org/html/2606.00014#bib.bib7)\), and SimCSEGaoet al\.\([2021](https://arxiv.org/html/2606.00014#bib.bib26)\), have also been widely used to compute inter\-sample similarity and optimize demonstration selection\. Moreover, some approaches assess the influence of individual samples on model predictions to select high\-impact examples for demonstrationsPenget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib14)\); S\.et al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib9)\)\. Off\-the\-shelf retrieval methods may yield suboptimal results, as they do not incorporate task\-specific information\. Therefore, some methods have explored leveraging feedback signals from LLMs to distinguish between important and unimportant samples, and further optimize the retriever for specific tasks using objectives such as rankingLiet al\.\([2023a](https://arxiv.org/html/2606.00014#bib.bib11)\), contrastive learningChenget al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib10)\); Luoet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib27)\), and diversityYeet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib24)\)\. But these methods often rely on feedback from LLMs, which leads to higher computational complexity\.
OOD Robustness in ICL\. In ICL settings, distribution shifts often cause substantial performance degradation, exposing models’ sensitivity and limited robustness to unseen domainsYuanet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib18)\); Wanget al\.\([2025](https://arxiv.org/html/2606.00014#bib.bib28)\)\. Such distributional gaps can undermine demonstration retrieval, as retrieved examples may no longer align semantically with the target task\. To address this issue, prior work has explored demonstration augmentation, including incorporating external knowledge such as linguistic rulesJianget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib29)\)or human feedbackBaiet al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib30)\)for fine\-tuning LLMs\. However, the necessity of fine\-tuning remains debated, with some studies suggesting that LLMs inherently possess the capacity to handle OOD dataUppaalet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib31)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib20)\)\. Motivated by this perspective, semantic rewriting has emerged as an alternative, prompting LLMs to adapt source samples to better align with the target domainO’Brienet al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib12)\); Madine \([2024](https://arxiv.org/html/2606.00014#bib.bib13)\)\.
## 3Method
Figure 1:The model architecture of DOPA based on the sentiment analysis task\. First, DOPA performs task\-specific instruction tuning on the source domain to obtain a source proxy based on any given LLM\. Correspondingly, an identical LLM without fine\-tuning, which preserves the prior knowledge of the target domain, is employed as the target\-domain proxy\. For the same input, the ratio between the two proxies is employed as an OOD proxy estimation, which is further combined with similarity and diversity to support multi\-granularity demonstration search\.### 3\.1Task Definitions and Model Description
Our model description begins with some definitions\. In the OOD setting, LLMsℳ\\mathcal\{M\}are restricted to using data from𝒟S\\mathcal\{D\}\_\{S\}to perform ICL, and are expected to make predictions on any samplextx\_\{t\}from𝒟T\\mathcal\{D\}\_\{T\}as accurately as possible\. During the inference process of LLMs, all samples from𝒟T\\mathcal\{D\}\_\{T\}other thanxtx\_\{t\}are strictly inaccessible, preventing the model from making decisions by referencing samples from a similar distribution\. For ICL, a prompt𝒫xt\\mathcal\{P\}\{x\_\{t\}\}is constructed by selectingN×\|Y\|N\\times\|Y\|labeled examples\(x\(j\),y\(j\)\)j=1N×\|Y\|\{\(x^\{\(j\)\},y^\{\(j\)\}\)\}\_\{j=1\}^\{N\\times\|Y\|\}from𝒟S\\mathcal\{D\}\_\{S\}, which are then concatenated withxtx\_\{t\}and fed into anyℳ\\mathcal\{M\}\. Here,\|Y\|\|Y\|denotes the size of the label space\. Then, the LLM produces a predictiony^t=ℳ\(𝒫xt\)\\hat\{y\}\_\{t\}=\\mathcal\{M\}\(\\mathcal\{P\}\{x\_\{t\}\}\)\. In different task settings,y^t\\hat\{y\}\_\{t\}can take various forms depending on the output space\. For classification tasks, it typically corresponds to a token representing a label category \(e\.g\., positive or negative\), while for generative tasks, it may be a string representing the desired output\. As illustrated in Figure[1](https://arxiv.org/html/2606.00014#S3.F1), DOPA comprises two main components: OOD proxy estimation and multi\-granularity demonstration retrieval\. The proxy estimation module assesses the proximity of source domain samples to the target domain using an OOD proxy, while the demonstration retrieval module selects appropriate examples by jointly optimizing semantic similarity and diversity constraints\.
### 3\.2OOD Proxy Estimation
The goal of the OOD proxy estimation is to evaluate the utility of source domain samples to select those that are more aligned with the target domain\. But without access to the target domain, it is difficult to accurately assess the target distribution\. Therefore, inspired by prior work on OOD detectionRenet al\.\([2019](https://arxiv.org/html/2606.00014#bib.bib19)\); Zhang and Wischik \([2022](https://arxiv.org/html/2606.00014#bib.bib16)\), we construct an OOD proxy to approximate the target domain distribution, and compute the OOD score of any sample via the proxy\. The OOD score is then used to guide sample selection from𝒟S\\mathcal\{D\}\_\{S\}\.
Proxy Construction\. The OOD proxy consists of two components: the source proxy and the target proxy, which ideally model the source and target distributions, respectively\. For the former, an intuitive approach is to instruction\-tune LLMs on the source domain so that the model can better adapt to the source distribution\. In DOPA, the instructions for source proxy are encapsulated in the same format as in ICL, aiming to prompt LLMs to produce reasonable task\-related predictions\. Formally, for easily accessible source domain dataxs∈𝒟Sx\_\{s\}\\in\\mathcal\{D\}\_\{S\}, DOPA optimizes the LLM through the following cross\-entropy supervised loss:
ℒsft=−∑j=1Tlogpℳ\(yi,j\|𝒫xs,yi,<t\),\\mathcal\{L\}\_\{sft\}=\-\\sum\_\{j=1\}^\{T\}\\text\{log\}p\_\{\\mathcal\{M\}\}\(y\_\{i,j\}\|\\mathcal\{P\}\_\{x\_\{s\}\},y\_\{i,\\textless t\}\),\(1\)where𝒫xs\\mathcal\{P\}\_\{x\_\{s\}\}is a task\-related prompt that does not contain any demonstrations\.
As for the target domain, since the target domain distribution is unknown, some methods propose a general approach by replacing the target\-domain proxy with a uniform distributionBishop \([1993](https://arxiv.org/html/2606.00014#bib.bib36)\); Nalisnicket al\.\([2019](https://arxiv.org/html/2606.00014#bib.bib17)\)\. Such a strong assumption is inherently destined to yield suboptimal results as shown in Lemma[1](https://arxiv.org/html/2606.00014#Thmlemma1)\. Given that LLMs are pretrained on extensive corpora, it is reasonable to assume that they implicitly encode a broad spectrum of linguistic and factual knowledge\. As such, LLMs can act as weak proxies for the target distribution, particularly in few\-shot settingsZhanget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib20)\)\.
Sample Screening based on OOD Score\. Given the aforementioned proxies, DOPA further assumes that if a sample exhibits divergent behavior under these two proxies, it may suggest an inherent bias or a stronger alignment toward one specific domain\. This facilitates domain discrimination in the absence of any auxiliary target domain samples\. In the previous research, the likelihood ratio is one of the most commonly used detection criteria for the divergent behaviorRenet al\.\([2019](https://arxiv.org/html/2606.00014#bib.bib19)\); Zhang and Wischik \([2022](https://arxiv.org/html/2606.00014#bib.bib16)\):
S\(x\)=Ptarget\(x\)Psource\(x\)≈Ptargetproxy\(x\)Psourceproxy\(x\)⏟OOD proxy,S\(x\)=\\frac\{P\_\{target\}\(x\)\}\{P\_\{source\}\(x\)\}\\approx\\underbrace\{\\frac\{P\_\{\\text\{target\}\}^\{\\text\{proxy\}\}\(x\)\}\{P\_\{\\text\{source\}\}^\{\\text\{proxy\}\}\(x\)\}\}\_\{\\text\{OOD proxy\}\},\(2\)wherePsource\(x\)P\_\{source\}\(x\)andPtarget\(x\)P\_\{target\}\(x\)represent the behavior of models with the source and target domain distributions when given the same input samplexx, respectively\. Under distributional uncertainty, the OOD proxy is used as their approximation\. To support the validity of OOD proxy estimation, we further establish a theoretical guarantee on the boundedness of proxy error under mild assumptions\.
###### Theorem 1\(Proxy Error Bound\)\.
LetPtargetP\_\{\\mathrm\{target\}\}andPsourceP\_\{\\mathrm\{source\}\}be the true probability distributions of the target and source domain, letPtargetproxyP\_\{\\mathrm\{target\}\}^\{\\mathrm\{proxy\}\}andPsourceproxyP\_\{\\mathrm\{source\}\}^\{\\mathrm\{proxy\}\}be the corresponding proxy distributions\. Suppose there exist constantsεt≥0\\varepsilon\_\{t\}\\geq 0,εs≥0\\varepsilon\_\{s\}\\geq 0,mt\>0m\_\{t\}\>0, andms\>0m\_\{s\}\>0such that the following hold:
- •The Kullback\-Leibler divergences are bounded:DKL\(Ptarget∥Ptargetproxy\)≤εt,DKL\(Psource∥Psourceproxy\)≤εsD\_\{\\mathrm\{KL\}\}\(P\_\{\\mathrm\{target\}\}\\parallel P\_\{\\mathrm\{target\}\}^\{\\mathrm\{proxy\}\}\)\\leq\\varepsilon\_\{t\},D\_\{\\mathrm\{KL\}\}\(P\_\{\\mathrm\{source\}\}\\parallel P\_\{\\mathrm\{source\}\}^\{\\mathrm\{proxy\}\}\)\\leq\\varepsilon\_\{s\}\. The proxy distributions have pointwise lower bounds:Ptarget\(x\),Ptargetproxy\(x\)≥mt,Psource\(x\),Psourceproxy\(x\)≥ms\.P\_\{\\mathrm\{target\}\}\(x\),P\_\{\\mathrm\{target\}\}^\{\\mathrm\{proxy\}\}\(x\)\\geq m\_\{t\},\\quad P\_\{\\mathrm\{source\}\}\(x\),P\_\{\\mathrm\{source\}\}^\{\\mathrm\{proxy\}\}\(x\)\\geq m\_\{s\}\.
Then, for allxx, the error in the log\-likelihood ratio satisfies:
\|logPtarget\(x\)Psource\(x\)−logPtargetproxy\(x\)Psourceproxy\(x\)\|≤εtmt\+εsms\.\\left\|\\log\\frac\{P\_\{\\mathrm\{target\}\}\(x\)\}\{P\_\{\\mathrm\{source\}\}\(x\)\}\-\\log\\frac\{P\_\{\\mathrm\{target\}\}^\{\\mathrm\{proxy\}\}\(x\)\}\{P\_\{\\mathrm\{source\}\}^\{\\mathrm\{proxy\}\}\(x\)\}\\right\|\\leq\\frac\{\\varepsilon\_\{t\}\}\{m\_\{t\}\}\+\\frac\{\\varepsilon\_\{s\}\}\{m\_\{s\}\}\.
###### Lemma 1\(Error Bound with Uniform Proxy\)\.
Building upon Theorem[1](https://arxiv.org/html/2606.00014#Thmtheorem1), if a uniform distribution is used as the proxy for target, it is more likely to result in a looser upper bound on the error\.
Due to space limitations, the proof of the above theorem is provided in Appendix[A](https://arxiv.org/html/2606.00014#A1)\. Such a uniform bound certifies that the proxy‐based score deviates from the true likelihood ratio by at most a known quantity, thereby providing theoretical assurance for reliable estimation\. In the case of autoregressive LLMs, perplexityWuhrmannet al\.\([2025](https://arxiv.org/html/2606.00014#bib.bib35)\)is commonly employed to quantify the model’s familiarity with a given textxx:PPL\(x\)=exp\(−1m∑i=1mlogP\(wi\|w<i\)\)PPL\(x\)=exp\\big\(\-\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}logP\(w\_\{i\}\|w\_\{<i\}\)\\big\), wheremmis the total number of tokens inxx, andP\(wi\|w<i\)P\(w\_\{i\}\|w\_\{<i\}\)is the conditional probability of the language model predicting theii\-th token\. Therefore, to conform to theloglogform as stated in Theorem[1](https://arxiv.org/html/2606.00014#Thmtheorem1), we adopt the log\-perplexity difference as a more stable alternative:
S\(x\)=logPPLtargetproxy\(x\)−logPPLsourceproxy\(x\)\.S\(x\)=logPPL\_\{target\}^\{proxy\}\(x\)\-logPPL\_\{source\}^\{proxy\}\(x\)\.\(3\)Ideally, if the value ofS\(x\)S\(x\)is relatively low, it exhibits higher perplexity under the source\-domain proxy and lower perplexity under the target\-domain proxy, which further indicates that the sample is more aligned with the target domain and should therefore be prioritized for constructing demonstrations\. By performing a single pass over𝒟S\\mathcal\{D\}\_\{S\}, we can obtain a potential subset𝒟^S\\hat\{\\mathcal\{D\}\}\_\{S\}that is closer to the target domain distribution by selecting thekksamples with the lowest OOD scores\.
### 3\.3Demonstration Retrieval
Although the OOD scores help identify source domain samples that are more likely to align with the target domain, the resulting coarse\-grained subset still requires further refinement to construct effective demonstrations\. Existing studies have provided strong support for the demonstration search processLiuet al\.\([2022](https://arxiv.org/html/2606.00014#bib.bib7)\); Agrawalet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib6)\), a general approach is to adopt an off\-the\-shelf text representation model to encode candidate texts into vectors and rank the most relevant demonstrations based on their cosine similarity with the test sample\. But one limitation of proxy\-based OOD scoring lies in its reliance on language model perplexity, which primarily captures token\-level fluency and distributional similarity\. As a result, it may implicitly favor shorter texts or those conforming to high\-frequency linguistic patternsHoltzmanet al\.\([2020](https://arxiv.org/html/2606.00014#bib.bib21)\), leading to reduced diversity in the selected sample pool and potentially impairing the quality of the retrieved demonstrations\. To address this issue, we further introduce a global diversity constraint to improve the overall quality of the retrieved demonstrations\. Specifically, for each sample representationhxih\_\{x\_\{i\}\}corresponding to the proxy\-filtered set𝒟^S\\hat\{\\mathcal\{D\}\}\_\{S\}, we initialize a candidate sample set𝒟demo\\mathcal\{D\}\_\{demo\}based on the similarity of the representations tohth\_\{t\}:
𝒟demo=argmaxC\{sim\(hxi,ht\)\}i=1\|𝒟^S\|,\\mathcal\{D\}\_\{demo\}=arg\\max\_\{C\}\\\{sim\(h\_\{x\_\{i\}\},h\_\{t\}\)\\\}\_\{i=1\}^\{\|\\hat\{\\mathcal\{D\}\}\_\{S\}\|\},\(4\)whereCCis the number of samples in the initialized candidates222We specify the value ofCCin the detailed experimental settings\.\. Subsequently, the mean pairwise Mahalanobis distanceLiet al\.\([2023b](https://arxiv.org/html/2606.00014#bib.bib22)\)among samples in𝒟demo\\mathcal\{D\}\_\{demo\}is used to quantify the diversity:
Div=2\|𝒟demo\|\(\|𝒟demo\|−1\)∑i<j𝐃ij⊤Σ−1𝐃ij,Div=\\frac\{2\}\{\|\\mathcal\{D\}\_\{demo\}\|\(\|\\mathcal\{D\}\_\{demo\}\|\-1\)\}\\sum\_\{i<j\}\\sqrt\{\\mathbf\{D\}^\{\\top\}\_\{ij\}\\Sigma^\{\-1\}\\mathbf\{D\}\_\{ij\}\},\(5\)where𝐃ij=hxi−hxj\\mathbf\{D\}\_\{ij\}=h\_\{x\_\{i\}\}\-h\_\{x\_\{j\}\},Σ\\Sigmais the empirical covariance matrix computed over all samples\. The Mahalanobis distance is adopted because it accounts for the correlations between samples while measuring diversity, which helps impose constraints on similarity\-based retrieval results\. If a new samplex^∈\{𝒟^S−𝒟demo\\hat\{x\}\\in\\\{\\hat\{\\mathcal\{D\}\}\_\{S\}\-\\mathcal\{D\}\_\{demo\}\} does not lead to a decrease in overall diversity i\.e\.Div𝒟demo≤Div\{x^\}∪𝒟demoDiv\_\{\\mathcal\{D\}\_\{demo\}\}\\leq Div\_\{\\\{\\hat\{x\}\\\}\\cup\\mathcal\{D\}\_\{demo\}\}, it is retained\. This process continues until the number of samples meets the required threshold for constructing demonstrations\. The final selected demonstration set is used for ICL\. The above procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.00014#alg1)\. After obtaining sufficient samples, we construct demonstrations in a fixed label order to prevent bias introduced by orders, and use them for ICL\.
Algorithm 1Demonstration Retrieval Process of DOPAInput: Proxy\-filtered set𝒟S^\\hat\{\\mathcal\{D\}\_\{S\}\}, test samplextx\_\{t\}\. Parameter: Demonstration quantityN×\|Y\|N\\times\|Y\|, initialized candidate setCC\. Output: Final demonstration set𝒟demo\\mathcal\{D\}\_\{demo\}with sizeNN\.
1:Init
𝒟demo\\mathcal\{D\}\_\{demo\}by Eq\.[4](https://arxiv.org/html/2606.00014#S3.E4), sort
𝒟S^\\hat\{\\mathcal\{D\}\_\{S\}\}in ascending order according to
sim\(hxi,ht\)sim\(h\_\{x\_\{i\}\},h\_\{t\}\), counter
←0\\leftarrow 0\.
2:while
\|𝒟demo\|<N×\|Y\|\|\\mathcal\{D\}\_\{demo\}\|<N\\times\|Y\|do
3:
x^←𝒟^S\[C\+counter\]\\hat\{x\}\\leftarrow\\hat\{\\mathcal\{D\}\}\_\{S\}\[C\+counter\]\.
4:if
Div𝒟demo≤Div\{x^\}∪𝒟demoDiv\_\{\\mathcal\{D\}\_\{demo\}\}\\leq Div\_\{\\\{\\hat\{x\}\\\}\\cup\\mathcal\{D\}\_\{demo\}\}then
5:
𝒟demo←\{x^\}∪𝒟demo\\mathcal\{D\}\_\{demo\}\\leftarrow\\\{\\hat\{x\}\\\}\\cup\\mathcal\{D\}\_\{demo\}\.
6:endif
7:counter
←\\leftarrowcounter \+ 1\.
8:endwhile
9:return
𝒟demo\\mathcal\{D\}\_\{demo\}
MethodsSATDNLIdynasentsemevalsstavgimplicitadvtoxigenavgwanlianlicnliavgGPT2\-xlRandom36\.3349\.2847\.7044\.4450\.4750\.2050\.6050\.4234\.2332\.2339\.2235\.23KNN35\.8945\.9251\.1744\.3347\.6747\.5048\.5047\.8933\.5733\.5045\.0537\.37DrICL37\.0047\.6652\.7645\.8149\.7051\.3848\.2349\.7732\.0333\.3347\.2037\.52Rewrite36\.0045\.8450\.6144\.1546\.9045\.7249\.3747\.3334\.0032\.7745\.3837\.38InfICL36\.6149\.2646\.9544\.2749\.9050\.3350\.1750\.1333\.8032\.4024\.2530\.15DICL38\.4649\.8656\.9248\.4146\.2047\.3746\.3746\.6532\.9333\.4744\.5736\.99DOPA\*38\.2348\.4459\.6148\.7651\.6753\.2950\.5051\.8234\.9333\.4345\.7738\.04LLaMA3\.2\-3BRandom53\.8147\.8666\.2655\.9857\.7055\.2065\.7059\.5337\.7035\.5042\.7538\.65KNN52\.6345\.7665\.4254\.6056\.6353\.2951\.0353\.6537\.2034\.6744\.2438\.70DrICL56\.0546\.0867\.1056\.4157\.8356\.1864\.8059\.6136\.5033\.8742\.9537\.77Rewrite53\.9245\.0664\.2954\.4351\.5757\.0462\.7057\.1036\.4335\.6042\.7538\.26InfICL53\.3546\.8064\.3954\.8456\.3355\.2065\.5759\.0336\.2336\.0040\.6537\.63DICL53\.7947\.5068\.4256\.5756\.6053\.8866\.0058\.8337\.6734\.6042\.5238\.26DOPA\*55\.7153\.2868\.8859\.2957\.8756\.4565\.3059\.8738\.4035\.8743\.1939\.15Gemma2\-2BRandom56\.4747\.0666\.4556\.6655\.5756\.5163\.9358\.6733\.3733\.0042\.6636\.34KNN55\.2947\.2866\.2656\.2853\.2047\.8963\.8754\.9933\.5032\.9341\.6136\.01DrICL57\.6747\.2067\.1057\.3255\.4356\.4561\.1757\.6833\.7333\.5745\.4337\.58Rewrite57\.9147\.1267\.0157\.3548\.5751\.8461\.8454\.0833\.7033\.3345\.2937\.44InfICL58\.0745\.3864\.5756\.0155\.6357\.5059\.9057\.6833\.2732\.9345\.2937\.16DICL55\.6148\.6668\.1057\.4654\.4053\.6865\.3057\.7933\.4333\.3345\.1537\.30DOPA\*57\.2447\.7068\.1357\.6956\.5358\.0965\.7360\.1233\.3733\.0746\.1037\.51Qwen3\-1\.7BRandom62\.8260\.9069\.1764\.2954\.9752\.8967\.2058\.3541\.3035\.1739\.1238\.53KNN60\.7558\.3270\.6763\.2556\.1050\.1365\.7357\.3241\.7735\.2037\.8338\.27DrICL61\.5460\.1670\.3864\.0354\.0755\.8666\.3758\.7642\.7735\.3337\.4938\.53Rewrite60\.3854\.7270\.2961\.8051\.3357\.5061\.9756\.9339\.4736\.6338\.7938\.30InfICL62\.1061\.1269\.9264\.3855\.3056\.8365\.2359\.1240\.8035\.5740\.0338\.80DICL59\.9457\.0069\.8262\.2655\.3754\.5464\.4558\.1242\.7035\.8339\.2239\.25DOPA\*63\.3559\.6471\.7964\.9355\.4756\.4565\.7359\.2242\.3736\.4740\.9439\.93LLaMA3\.1\-8BRandom58\.4652\.0269\.6360\.0456\.5055\.6666\.4059\.5239\.7336\.6743\.0939\.83KNN57\.6149\.3270\.4859\.1357\.2354\.8065\.5059\.1841\.0037\.4042\.2340\.21DrICL59\.2051\.2269\.7360\.0558\.5756\.6467\.7360\.9840\.7035\.9042\.3239\.64Rewrite56\.5948\.7469\.0758\.1354\.5759\.6763\.8759\.3738\.8736\.0043\.2339\.37InfICL57\.6352\.9270\.7660\.4459\.0059\.7463\.7760\.8340\.8737\.2343\.0440\.38DICL58\.3551\.4470\.1059\.9657\.0355\.9965\.9759\.6638\.9736\.3742\.1339\.16DOPA\*59\.2551\.8472\.1661\.0858\.6059\.2867\.4361\.7741\.3337\.5342\.7540\.54Qwen3\-8BRandom68\.5663\.4276\.8569\.6156\.9056\.7165\.6359\.7541\.1335\.7728\.1735\.02KNN65\.3463\.7876\.0168\.3857\.1758\.0368\.0761\.0941\.9334\.9330\.2735\.71DrICL68\.4063\.8278\.2670\.1663\.1057\.0477\.4365\.8640\.5736\.1032\.8136\.49Rewrite64\.2560\.0874\.9866\.4455\.6359\.4170\.5061\.8538\.8334\.6331\.0934\.85InfICL70\.7164\.0475\.3570\.0356\.3357\.9665\.9760\.0940\.0735\.6035\.2936\.99DICL64\.3263\.4676\.1067\.9660\.8059\.2177\.5365\.8541\.6035\.3329\.6035\.51DOPA\*70\.9063\.3278\.3570\.8662\.8759\.0879\.0066\.9845\.4038\.1034\.2939\.26Table 1:The performance \(accuracy %\) on classification tasks, \* indicates that the results based on the LLM among all the datasets are statistically significant under the Wilcoxon Signed\-Rank Test \(p≤0\.05p\\leq 0\.05\)\. The calculation process of significance is presented in Appendix[B\.3](https://arxiv.org/html/2606.00014#A2.SS3)\.
## 4Experiment
### 4\.1Experimental Setup
We conduct experiments on the OOD\-specific benchmark BOSSYuanet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib18)\), which includes three classification tasks, Sentiment Analysis \(SA\), Toxicity Detection \(TD\), and Natural Language Inference \(NLI\) as well as one generation task, Named Entity Recognition \(NER\)\. All instruction templates follow the format provided in the original BOSS paper\. In addition, we compare our proposed method DOPA with various baseline approaches to comprehensively demonstrate its advantages\. These baselines include:RandomPenget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib14)\),KNNLiuet al\.\([2022](https://arxiv.org/html/2606.00014#bib.bib7)\),DrICLLuoet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib27)\),RewriteMadine \([2024](https://arxiv.org/html/2606.00014#bib.bib13)\),InfICLS\.et al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib9)\), andDICLKapuriyaet al\.\([2025](https://arxiv.org/html/2606.00014#bib.bib38)\), where Rewrite refers to the data augmentation\-based method, while the others are demonstration retrieval\-based methods\. All the methods are implemented in different LLMs to verify the adaptability of the proposed methods, includingGPT2\-xl,Qwen3\-1\.7B,Qwen3\-8B,Gemma2\-2B, andLLaMA3\.2\-3B,LLaMA3\.1\-8B\. In addition, to investigate the performance of the proposed method on closed\-source models, we also conduct experiments onGPT4o\-miniandGPT3\.5\-turbo, and compare them with KNN, InfICL, Rewrite and DICL\. Additional experimental settings can be found in Appendix[B](https://arxiv.org/html/2606.00014#A2)\.
MethodsSATDNLIdynasentsemevalsstavgimplicitadvtoxigenavgwanlianlicnliavgGPT4o\-miniKNN67\.6760\.5078\.1768\.7857\.6357\.1382\.0065\.5837\.6739\.6726\.8334\.72Rewrite64\.3356\.1775\.8365\.4454\.6354\.7572\.2560\.5436\.8337\.6727\.6734\.06InfICL63\.8354\.1780\.6766\.2259\.3862\.8884\.1568\.7938\.3838\.0032\.8336\.39DICL68\.6762\.0078\.0069\.5657\.3854\.7580\.6364\.2537\.0036\.6720\.5031\.39DOPA\*67\.8362\.0081\.1770\.3359\.5065\.2583\.2569\.3338\.6740\.8332\.6737\.39GPT3\.5\-turboKNN67\.1760\.0078\.8368\.6758\.3860\.5082\.2367\.0438\.0040\.0030\.8336\.28Rewrite66\.0055\.1775\.5065\.5654\.6353\.7573\.5060\.6337\.0038\.0028\.6734\.56InfICL66\.0055\.8380\.1767\.3360\.0064\.1384\.5069\.5437\.6739\.5032\.5036\.56DICL61\.8343\.5062\.3355\.8938\.5039\.6240\.6339\.5810\.6713\.678\.1710\.83DOPA\*68\.0061\.1780\.5069\.8958\.5066\.1382\.6369\.0838\.1739\.3332\.0036\.50Table 2:The performance \(accuracy %\) on classification tasks based on closed\-source models\.Table 3:The performance on NER tasks\.Table 4:Ablation study results onLLaMA3\.2\-3BandQwen3\-1\.7B\.
### 4\.2Experimental Results
We present the comparison results of DOPA with the aforementioned baseline methods on different LLMs in Table[1](https://arxiv.org/html/2606.00014#S3.T1), Table[2](https://arxiv.org/html/2606.00014#S4.T2), and Table[3](https://arxiv.org/html/2606.00014#S4.T3)\. We do not compare InfICL and Rewrite on the NER task because, for token\-level tasks, the influence of individual samples is difficult to quantify, and sentence rewriting may change the original entities\. As an alternative, we compare with KNN, DrICL and DICL, which are not affected by the type of task\.
For classification tasks \(Table[1](https://arxiv.org/html/2606.00014#S3.T1)\), DOPA exhibits noticeable performance degradation in only a few cases, demonstrating strong robustness under distribution shift\. Wilcoxon signed\-rank tests over nine evaluation tasks further confirm that DOPA significantly outperforms all baselines\. By contrast, several recent methods fail to consistently surpass random selection in OOD settings\. Notably, underLLaMA3\.2\-3B, Random sampling consistently outperforms similarity\-based retrieval such as KNN, highlighting the persistent challenges posed by distribution shift and the instability of purely semantic retrieval\. Methods leveraging additional signals show mixed effectiveness\. DrICL, which incorporates LLM feedback to train a dense retriever, generally improves upon KNN but remains inferior to DOPA overall\. The Rewrite strategy performs poorly, likely due to the absence of target\-domain samples, which constrains the quality of rewritten demonstrations\. InfICL achieves performance comparable to DOPA in a few settings \(e\.g\., TD withQwen3\-1\.7BandGPT3\.5\-turbo\) but exhibits substantial instability, performing worst on NLI withGPT2\-xlandLLaMA3\.2\-3B, and on SA withGemma2\-2B\. Finally, the diversity\-based method DICL fails to reliably alleviate distribution shift and can even degrade performance, as observed in SA withQwen3\-1\.7Band NLI withGPT4o\-mini\.
For generative NER tasks \(Table[3](https://arxiv.org/html/2606.00014#S4.T3)\), DOPA yields more pronounced performance gains, likely due to the higher difficulty of NER compared to classification tasks, which makes it more sensitive to the distribution of demonstration samples\. We further observe that KNN\-based retrieval benefits lightweight, locally deployable LLMs, as these models rely more heavily on external examples for guidance\. In contrast, for larger models such asGPT4o\-miniandGPT3\.5\-turbo, KNN retrieval can be detrimental\. This is likely because their stronger reasoning capabilities and heightened sensitivity to distribution shifts make them more susceptible to misleading demonstrations that are semantically similar but distributionally mismatched\. Overall, DOPA consistently improves performance while maintaining robustness across diverse tasks and models\. Motivated by these results, we proceed to analyze the underlying mechanisms of DOPA\.
### 4\.3Experimental Analysis
Figure 2:Performance influence ofkkonLLaMA3\.2\-3BandQwen3\-1\.7Bacross tasks\.Figure 3:Performance influence ofNNon DOPA and KNN based onLLaMA3\.2\-3BandQwen3\-1\.7B, the shaded areas with corresponding colors indicate the performance differences\.#### 4\.3\.1Ablation Study
To assess the contribution of each component in DOPA, we conduct an ablation study comparing several variants \(Table[4](https://arxiv.org/html/2606.00014#S4.T4)\)\. Specifically,DOPA−pro\{\-pro\}removes the OOD proxy and relies solely on representation similarity;DOPA−sim\{\-sim\}discards semantic similarity and performs retrieval purely based on the OOD proxy;DOPA−mah\{\-mah\}removes the Mahalanobis distance–based diversity constraint; andDOPAuni\{uni\}replaces the LLM\-based target proxy with a uniform distribution to empirically validate Lemma[1](https://arxiv.org/html/2606.00014#Thmlemma1)\. All variants result in performance degradation, with the smallest drop observed for DOPA−mah\{\-mah\}, followed by DOPA−pro\{\-pro\}and DOPAuni\{uni\}, and the largest drop for DOPA−sim\{\-sim\}\. These results highlight the complementary roles of DOPA’s components: the OOD proxy enables coarse target\-domain alignment, semantic similarity provides critical fine\-grained filtering, and the diversity constraint promotes representative demonstrations\. Notably, replacing the learned proxy with a uniform one leads to substantial degradation, supporting the effectiveness of the LLM\-based proxy and the validity of Lemma[1](https://arxiv.org/html/2606.00014#Thmlemma1)\. Overall, proxy\-based filtering and diversity\-aware retrieval jointly contribute to improved demonstration quality and model performance\.
#### 4\.3\.2Exploration ofkk
We investigate the impact of varyingk∈300,500,800,1000k\\in\{300,500,800,1000\}on demonstration selection and model performance \(Figure[2](https://arxiv.org/html/2606.00014#S4.F2)\)\. Results indicate that smallkkvalues limit example diversity and hinder generalization, whereas largerkkvalues introduce noisy or redundant demonstrations that may degrade performance\. Based on systematic evaluations across multiple tasks and datasets, we adoptk=800k=800as a unified setting, as it provides a favorable balance between diversity and noise, despite not being optimal in all cases\. Using a fixedkkalso simplifies the selection process and ensures more consistent and comparable performance across tasks\.
#### 4\.3\.3Exploration ofNN
We study the effect of varyingN∈1,2,3,4,5N\\in\{1,2,3,4,5\}on model performance \(Figure[3](https://arxiv.org/html/2606.00014#S4.F3)\), using KNN as a task\-agnostic and stable baseline\. Here,NNcorresponds toN×\|Y\|N\\times\|Y\|demonstration samples\. Performance generally improves with more demonstrations before gradually saturatingMinet al\.\([2022](https://arxiv.org/html/2606.00014#bib.bib3)\), with the saturation point varying across models and tasks\. For instance, in SA,Qwen3\-1\.7Bpeaks atN=4N=4, whileLLaMA3\.2\-3Bpeaks atN=3N=3\. Across allNNsettings, DOPA consistently outperforms KNN by a substantial margin\. For simplicity and consistency in the main experiments, we fixN=3N=3\.
#### 4\.3\.4Visualization
\(a\)KDE visualization\.
\(b\)t\-SNE visualization\.
\(c\)Euclidean distance comparison to target domain samples for retrieval results with and without the diversity constraint \(with MahDist and w/o MahDist\)\.
Figure 4:Different visualization results onsst\.We validate the effectiveness of the proposed OOD proxy through visualization of proxy\-based sample selection\. Specifically, we characterize the behaviors of𝒟S\\mathcal\{D\}\_\{S\},𝒟^S\\hat\{\\mathcal\{D\}\}\_\{S\}, and𝒟T\\mathcal\{D\}\_\{T\}by computing BERT\-based energy scoresLiuet al\.\([2020](https://arxiv.org/html/2606.00014#bib.bib32)\)and estimating their distributions via kernel density estimation \(KDE\)Wkeglarczyk \([2018](https://arxiv.org/html/2606.00014#bib.bib34)\)\. We adopt energy\-score distributions rather than the commonly used t\-SNE visualizations, as representational proximity alone does not adequately reflect OOD tendencies\. As shown in Figure[4](https://arxiv.org/html/2606.00014#S4.F4), the proxy\-selected samples exhibit an energy\-score distribution that more closely matches and overlaps with that of the target domain \(Figure[4\(a\)](https://arxiv.org/html/2606.00014#S4.F4.sf1)\), indicating effective source\-to\-target alignment\. In contrast, t\-SNE visualizations \(Figure[4\(b\)](https://arxiv.org/html/2606.00014#S4.F4.sf2)\) show that proxy\-selected samples remain closer to the source domain in representation space, suggesting residual semantic distance from the target domain\. This discrepancy further explains the results in Table[1](https://arxiv.org/html/2606.00014#S3.T1), demonstrating that semantic similarity–based retrieval \(e\.g\., KNN\) is insufficient for OOD adaptation, whereas DOPA effectively addresses the limitations of purely representation\-based methods\.
Additionally, to assess the effectiveness of the diversity constraint, we compute the Euclidean distances between retrieved demonstrations and their corresponding test samples for the first 1,000 test instances, under both with\- and without\-MahDist settings\. Figure[4\(c\)](https://arxiv.org/html/2606.00014#S4.F4.sf3)presents the corresponding fitted distance distributions\. The curve without MahDist consistently lies below that with MahDist, indicating that the diversity constraint encourages more varied retrieval\. Importantly, the two curves remain close, suggesting that this increased diversity is controlled and does not introduce excessive semantic drift\. In summary, the visualization results provide evidence for the effectiveness of DOPA from two perspectives: it helps retrieve demonstrations that exhibit similar behavior to target domain samples while maintaining high diversity, thereby enhancing the performance of ICL\. We also observe similar trends across the remaining datasets\. Additional visualization results are in Appendix[F](https://arxiv.org/html/2606.00014#A6)\.
## 5Conclusion
This paper shows that OOD proxies can effectively retrieve source samples aligned with the target domain under distribution shift\. Based on this insight, we propose DOPA, a target\-free framework enhanced with a diversity constraint to mitigate proxy bias\. Experiments across multiple LLMs confirm its effectiveness\. Future work will focus on more robust proxy estimation for unknown target domains\.
## Limitations
Since target\-domain data is unavailable, the constructed target\-domain proxies may not fully capture the true distributional characteristics of the target dataset, inevitably introducing approximation errors\. Consequently, developing more accurate and robust target\-domain proxy construction methods remains an important direction for future work\. In addition, evaluating DOPA across a broader range of LLM families would further clarify its generality and adaptability\.
## Acknowledgements
This work is supported by the National Natural Science Foundation of China \(NSFC\): “Research on Understanding Ancient Characters Based on Multi\-modal Large Models” \(Grant No\. 62476111\), China Postdoctoral Science Foundation Funded Project \(Grant No\. 2024M761122\), the Scientific Research Project of the Education Department of Jilin Province \(Grant No\. JJKH20261299KJ\) and the Industry University Research Innovation Fund of the Ministry of Education project "Research and Application of an Integrated Teaching Model for Human centered Artificial Intelligence" \(Grant No\. 2022XF017\)\.
## References
- S\. Agrawal, C\. Zhou, M\. Lewis, L\. Zettlemoyer, and M\. Ghazvininejad \(2023\)In\-context examples selection for machine translation\.InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9\-14, 2023,pp\. 8857–8873\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p1.1),[§3\.3](https://arxiv.org/html/2606.00014#S3.SS3.p1.4)\.
- H\. Bai, X\. Du, K\. Rainey, S\. Parameswaran, and Y\. Li \(2024\)Out\-of\-distribution learning with human feedback\.CoRRabs/2408\.07772\.Cited by:[§2](https://arxiv.org/html/2606.00014#S2.p2.1)\.
- C\. M\. Bishop \(1993\)Novelty detection and neural network validation\.InICANN’93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13–16 September 1993 3,pp\. 789–794\.Cited by:[§3\.2](https://arxiv.org/html/2606.00014#S3.SS2.p3.1)\.
- Y\. Chang, X\. Wang, J\. Wang, Y\. Wu, L\. Yang, K\. Zhu, H\. Chen, X\. Yi, C\. Wang, Y\. Wang, W\. Ye, Y\. Zhang, Y\. Chang, P\. S\. Yu, Q\. Yang, and X\. Xie \(2024\)A survey on evaluation of large language models\.ACM Trans\. Intell\. Syst\. Technol\.15\(3\),pp\. 39:1–39:45\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p1.1)\.
- D\. Cheng, S\. Huang, J\. Bi, Y\. Zhan, J\. Liu, Y\. Wang, H\. Sun, F\. Wei, W\. Deng, and Q\. Zhang \(2023\)UPRISE: universal prompt retrieval for improving zero\-shot evaluation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023,pp\. 12318–12337\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p1.1)\.
- J\. Demsar \(2006\)Statistical comparisons of classifiers over multiple data sets\.J\. Mach\. Learn\. Res\.7,pp\. 1–30\.Cited by:[§B\.3](https://arxiv.org/html/2606.00014#A2.SS3.p2.1)\.
- T\. Gao, X\. Yao, and D\. Chen \(2021\)SimCSE: simple contrastive learning of sentence embeddings\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7\-11 November, 2021,pp\. 6894–6910\.Cited by:[§2](https://arxiv.org/html/2606.00014#S2.p1.1)\.
- A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi \(2020\)The curious case of neural text degeneration\.In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\-30, 2020,Cited by:[§3\.3](https://arxiv.org/html/2606.00014#S3.SS3.p1.4)\.
- S\. Jiang, Q\. Chen, Y\. Xiang, Y\. Pan, and Y\. Lin \(2024\)Linguistic rule induction improves adversarial and OOD robustness in large language models\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20\-25 May, 2024, Torino, Italy,pp\. 10565–10577\.Cited by:[§2](https://arxiv.org/html/2606.00014#S2.p2.1)\.
- J\. Kapuriya, M\. Kaushik, D\. Ganguly, and S\. Bhatia \(2025\)Exploring the role of diversity in example selection for in\-context learning\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13\-18, 2025,pp\. 2962–2966\.Cited by:[6th item](https://arxiv.org/html/2606.00014#A2.I1.i6.p1.1),[§4\.1](https://arxiv.org/html/2606.00014#S4.SS1.p1.1)\.
- X\. Li, K\. Lv, H\. Yan, T\. Lin, W\. Zhu, Y\. Ni, G\. Xie, X\. Wang, and X\. Qiu \(2023a\)Unified demonstration retriever for in\-context learning\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2023, Toronto, Canada, July 9\-14, 2023,pp\. 4644–4668\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p1.1)\.
- Y\. Li, M\. Du, X\. Wang, and Y\. Wang \(2023b\)Prompt tuning pushes farther, contrastive learning pulls closer: A two\-stage approach to mitigate social biases\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2023, Toronto, Canada, July 9\-14, 2023,pp\. 14254–14267\.Cited by:[§3\.3](https://arxiv.org/html/2606.00014#S3.SS3.p1.6)\.
- J\. Liu, D\. Shen, Y\. Zhang, B\. Dolan, L\. Carin, and W\. Chen \(2022\)What makes good in\-context examples for gpt\-3?\.InProceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, DeeLIO@ACL 2022, Dublin, Ireland and Online, May 27, 2022,pp\. 100–114\.Cited by:[2nd item](https://arxiv.org/html/2606.00014#A2.I1.i2.p1.1),[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p1.1),[§3\.3](https://arxiv.org/html/2606.00014#S3.SS3.p1.4),[§4\.1](https://arxiv.org/html/2606.00014#S4.SS1.p1.1)\.
- W\. Liu, X\. Wang, J\. D\. Owens, and Y\. Li \(2020\)Energy\-based out\-of\-distribution detection\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,Cited by:[§4\.3\.4](https://arxiv.org/html/2606.00014#S4.SS3.SSS4.p1.3)\.
- M\. Luo, X\. Xu, Z\. Dai, P\. Pasupat, S\. M\. Kazemi, C\. Baral, V\. Imbrasaite, and V\. Y\. Zhao \(2023\)Dr\.icl: demonstration\-retrieved in\-context learning\.CoRRabs/2305\.14128\.Cited by:[3rd item](https://arxiv.org/html/2606.00014#A2.I1.i3.p1.1),[§2](https://arxiv.org/html/2606.00014#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.00014#S4.SS1.p1.1)\.
- M\. Luo, X\. Xu, Y\. Liu, P\. Pasupat, and M\. Kazemi \(2024\)In\-context learning with retrieved demonstrations for language models: A survey\.CoRRabs/2401\.11624\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p1.1)\.
- M\. Madine \(2024\)Bridging distribution gap via semantic rewriting with llms to enhance OOD robustness\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 \- Student Research Workshop, Bangkok, Thailand, August 11\-16, 2024,pp\. 458–468\.Cited by:[4th item](https://arxiv.org/html/2606.00014#A2.I1.i4.p1.1),[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.00014#S4.SS1.p1.1)\.
- S\. Min, X\. Lyu, A\. Holtzman, M\. Artetxe, M\. Lewis, H\. Hajishirzi, and L\. Zettlemoyer \(2022\)Rethinking the role of demonstrations: what makes in\-context learning work?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7\-11, 2022,pp\. 11048–11064\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p1.1),[§4\.3\.3](https://arxiv.org/html/2606.00014#S4.SS3.SSS3.p1.7)\.
- E\. T\. Nalisnick, A\. Matsukawa, Y\. W\. Teh, D\. Görür, and B\. Lakshminarayanan \(2019\)Do deep generative models know what they don’t know?\.In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\-9, 2019,Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.00014#S3.SS2.p3.1)\.
- J\. Ni, C\. Qu, J\. Lu, Z\. Dai, G\. H\. Abrego, J\. Ma, V\. Zhao, Y\. Luan, K\. Hall, M\. Chang,et al\.\(2022\)Large dual encoders are generalizable retrievers\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 9844–9855\.Cited by:[3rd item](https://arxiv.org/html/2606.00014#A2.I1.i3.p1.1)\.
- K\. O’Brien, N\. Ng, I\. Puri, J\. Mendez, H\. Palangi, Y\. Kim, M\. Ghassemi, and T\. Hartvigsen \(2024\)Improving black\-box robustness with in\-context rewriting\.CoRRabs/2402\.08225\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p2.1)\.
- K\. Peng, L\. Ding, Y\. Yuan, X\. Liu, M\. Zhang, Y\. Ouyang, and D\. Tao \(2024\)Revisiting demonstration selection strategies in in\-context learning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,pp\. 9090–9101\.Cited by:[1st item](https://arxiv.org/html/2606.00014#A2.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.00014#S4.SS1.p1.1)\.
- J\. Ren, P\. J\. Liu, E\. Fertig, J\. Snoek, R\. Poplin, M\. A\. DePristo, J\. V\. Dillon, and B\. Lakshminarayanan \(2019\)Likelihood ratios for out\-of\-distribution detection\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,pp\. 14680–14691\.Cited by:[§3\.2](https://arxiv.org/html/2606.00014#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.00014#S3.SS2.p4.4)\.
- V\. M\. S\., M\. Van, and X\. Wu \(2024\)In\-context learning demonstration selection via influence analysis\.CoRRabs/2402\.11750\.Cited by:[5th item](https://arxiv.org/html/2606.00014#A2.I1.i5.p1.1),[§1](https://arxiv.org/html/2606.00014#S1.p2.1),[§2](https://arxiv.org/html/2606.00014#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.00014#S4.SS1.p1.1)\.
- L\. Shu, L\. Luo, J\. Hoskere, Y\. Zhu, Y\. Liu, S\. Tong, J\. Chen, and L\. Meng \(2024\)RewriteLM: an instruction\-tuned large language model for text rewriting\.InThirty\-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty\-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20\-27, 2024, Vancouver, Canada,pp\. 18970–18980\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p2.1)\.
- R\. Song, F\. Giunchiglia, Y\. Li, M\. Tian, and H\. Xu \(2024a\)TACIT: A target\-agnostic feature disentanglement framework for cross\-domain text classification\.InThirty\-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024,pp\. 18999–19007\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p2.1.1.1)\.
- R\. Song, Y\. Li, L\. Shi, F\. Giunchiglia, and H\. Xu \(2024b\)Shortcut learning in in\-context learning: A survey\.CoRRabs/2411\.02018\.Cited by:[§2](https://arxiv.org/html/2606.00014#S2.p1.1)\.
- R\. Song, Y\. Li, M\. Tian, H\. Wang, F\. Giunchiglia, and H\. Xu \(2025\)Causal keyword driven reliable text classification with large language model feedback\.Inf\. Process\. Manag\.62\(3\),pp\. 103964\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p1.1)\.
- R\. Uppaal, J\. Hu, and Y\. Li \(2023\)Is fine\-tuning needed? pre\-trained language models are near perfect for out\-of\-domain detection\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2023, Toronto, Canada, July 9\-14, 2023,pp\. 12813–12832\.Cited by:[§2](https://arxiv.org/html/2606.00014#S2.p2.1)\.
- L\. Wang, N\. Yang, and F\. Wei \(2024\)Learning to retrieve in\-context examples for large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 \- Volume 1: Long Papers, St\. Julian’s, Malta, March 17\-22, 2024,pp\. 1752–1767\.Cited by:[§2](https://arxiv.org/html/2606.00014#S2.p1.1)\.
- Q\. Wang, Y\. Wang, X\. Ying, and Y\. Wang \(2025\)Can in\-context learning really generalize to out\-of\-distribution tasks?\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p1.1),[§2](https://arxiv.org/html/2606.00014#S2.p2.1)\.
- S\. Wkeglarczyk \(2018\)Kernel density estimation and its application\.InITM web of conferences,Vol\.23,pp\. 00037\.Cited by:[§4\.3\.4](https://arxiv.org/html/2606.00014#S4.SS3.SSS4.p1.3)\.
- A\. Wuhrmann, A\. Kucharavy, and A\. Kucherenko \(2025\)Low\-perplexity llm\-generated sequences and where to find them\.InACL 2025 Student Research Workshop,Cited by:[§3\.2](https://arxiv.org/html/2606.00014#S3.SS2.p5.7)\.
- S\. Xu and C\. Zhang \(2024\)Misconfidence\-based demonstration selection for LLM in\-context learning\.CoRRabs/2401\.06301\.Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p2.1)\.
- J\. Ye, Z\. Wu, J\. Feng, T\. Yu, and L\. Kong \(2023\)Compositional exemplars for in\-context learning\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,Proceedings of Machine Learning Research, Vol\.202,pp\. 39818–39833\.Cited by:[§2](https://arxiv.org/html/2606.00014#S2.p1.1)\.
- L\. Yuan, Y\. Chen, G\. Cui, H\. Gao, F\. Zou, X\. Cheng, H\. Ji, Z\. Liu, and M\. Sun \(2023\)Revisiting out\-of\-distribution robustness in NLP: benchmarks, analysis, and llms evaluations\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§B\.1](https://arxiv.org/html/2606.00014#A2.SS1.p1.1),[§1](https://arxiv.org/html/2606.00014#S1.p1.1),[§2](https://arxiv.org/html/2606.00014#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.00014#S4.SS1.p1.1)\.
- A\. Zhang and D\. Wischik \(2022\)Falsehoods that ML researchers believe about OOD detection\.InNeurIPS ML Safety Workshop,Cited by:[§1](https://arxiv.org/html/2606.00014#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.00014#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.00014#S3.SS2.p4.4)\.
- A\. Zhang, T\. Z\. Xiao, W\. Liu, R\. Bamler, and D\. Wischik \(2024\)Your finetuned large language model is already a powerful out\-of\-distribution detector\.CoRRabs/2404\.08679\.Cited by:[§2](https://arxiv.org/html/2606.00014#S2.p2.1),[§3\.2](https://arxiv.org/html/2606.00014#S3.SS2.p3.1)\.
## Appendix ATheoretical Analysis and Proof
The following provides a detailed proof of the boundedness of proxy errors\.
###### Proof\.
We use shorthand notation: letPt:=PtargetP\_\{t\}:=P\_\{\\mathrm\{target\}\},Ps:=PsourceP\_\{s\}:=P\_\{\\mathrm\{source\}\},Ptp:=PtargetproxyP\_\{t\}^\{p\}:=P\_\{\\mathrm\{target\}\}^\{\\mathrm\{proxy\}\}, andPsp:=PsourceproxyP\_\{s\}^\{p\}:=P\_\{\\mathrm\{source\}\}^\{\\mathrm\{proxy\}\}\.
We aim to bound the log\-likelihood ratio error:
Δ\(x\):=\|logPt\(x\)Ps\(x\)−logPtp\(x\)Psp\(x\)\|\\Delta\(x\):=\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{s\}\(x\)\}\-\\log\\frac\{P\_\{t\}^\{p\}\(x\)\}\{P\_\{s\}^\{p\}\(x\)\}\\right\|
Applying the triangle inequality:
Δ\(x\)=\|logPt\(x\)Ptp\(x\)−logPs\(x\)Psp\(x\)\|\\displaystyle\\Delta\(x\)=\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\-\\log\\frac\{P\_\{s\}\(x\)\}\{P\_\{s\}^\{p\}\(x\)\}\\right\|≤\|logPt\(x\)Ptp\(x\)\|\+\|logPs\(x\)Psp\(x\)\|\\displaystyle\\leq\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\right\|\+\\left\|\\log\\frac\{P\_\{s\}\(x\)\}\{P\_\{s\}^\{p\}\(x\)\}\\right\|
We now upper bound each term\. Then, from the definition of KL divergence:
DKL\(Pt∥Ptp\)=∑xPt\(x\)logPt\(x\)Ptp\(x\)≤εtD\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\\|P\_\{t\}^\{p\}\)=\\sum\_\{x\}P\_\{t\}\(x\)\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\leq\\varepsilon\_\{t\}
Now, suppose for somexxwe havePt\(x\)≥mtP\_\{t\}\(x\)\\geq m\_\{t\}and
\|logPt\(x\)Ptp\(x\)\|\>εtmt\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\right\|\>\\frac\{\\varepsilon\_\{t\}\}\{m\_\{t\}\}Then,
Pt\(x\)⋅\|logPt\(x\)Ptp\(x\)\|\>mt⋅εtmt=εtP\_\{t\}\(x\)\\cdot\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\right\|\>m\_\{t\}\\cdot\\frac\{\\varepsilon\_\{t\}\}\{m\_\{t\}\}=\\varepsilon\_\{t\}This contradicts the assumptionDKL\(Pt∥Ptp\)≤εtD\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\\|P\_\{t\}^\{p\}\)\\leq\\varepsilon\_\{t\}\. Therefore, for allxx:
\|logPt\(x\)Ptp\(x\)\|≤εtmt\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\right\|\\leq\\frac\{\\varepsilon\_\{t\}\}\{m\_\{t\}\}
Analogously, we obtain:
\|logPs\(x\)Psp\(x\)\|≤εsms\\left\|\\log\\frac\{P\_\{s\}\(x\)\}\{P\_\{s\}^\{p\}\(x\)\}\\right\|\\leq\\frac\{\\varepsilon\_\{s\}\}\{m\_\{s\}\}
Combining the two bounds:
Δ\(x\)≤εtmt\+εsms,\\Delta\(x\)\\leq\\frac\{\\varepsilon\_\{t\}\}\{m\_\{t\}\}\+\\frac\{\\varepsilon\_\{s\}\}\{m\_\{s\}\},the proof of the theorem is complete\. ∎
The theorem shows that if the KL\-divergence between the true distribution and its proxy is sufficiently small, and the probability mass at each point is lower bounded, then the deviation in log\-probability ratios is controllable in expectation\. Therefore, a properly constructed proxy distribution yields bounded error in tasks such as density ratio estimation or scoring, which verifies the effectiveness and reliability of using proxies\.
Moreover, some methods propose a general approach by replacing the target\-domain proxy with a uniform distribution\. However, this strong assumption may lead to suboptimal solutions\. Accordingly, we introduce Lemma[1](https://arxiv.org/html/2606.00014#Thmlemma1)to illustrate the limitations of using a uniform distribution\.
###### Proof of Lemma[1](https://arxiv.org/html/2606.00014#Thmlemma1)\.
We consider the case where the proxy distribution for the target domain is chosen as the uniform distribution over the support𝒳\\mathcal\{X\}:
Ptp\(x\)=1\|𝒳\|for allx∈𝒳P\_\{t\}^\{p\}\(x\)=\\frac\{1\}\{\|\\mathcal\{X\}\|\}\\quad\\text\{for all \}x\\in\\mathcal\{X\}
From Theorem[1](https://arxiv.org/html/2606.00014#Thmtheorem1), the error in the log\-likelihood ratio satisfies:
\|logPt\(x\)Ps\(x\)−logPtp\(x\)Psp\(x\)\|\\displaystyle\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{s\}\(x\)\}\-\\log\\frac\{P\_\{t\}^\{p\}\(x\)\}\{P\_\{s\}^\{p\}\(x\)\}\\right\|≤\|logPt\(x\)Ptp\(x\)\|\+\|logPs\(x\)Psp\(x\)\|\\displaystyle\\leq\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\right\|\+\\left\|\\log\\frac\{P\_\{s\}\(x\)\}\{P\_\{s\}^\{p\}\(x\)\}\\right\|
We now focus on bounding the first term with the uniform proxy:
At\(x\):=\|logPt\(x\)Ptp\(x\)\|=\|log\(Pt\(x\)⋅\|𝒳\|\)\|\\displaystyle A\_\{t\}\(x\):=\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\right\|=\\left\|\\log\\left\(P\_\{t\}\(x\)\\cdot\|\\mathcal\{X\}\|\\right\)\\right\|=\|logPt\(x\)\+log\|𝒳\|\|\\displaystyle=\\left\|\\log P\_\{t\}\(x\)\+\\log\|\\mathcal\{X\}\|\\right\|
From the definition of KL divergence betweenPtP\_\{t\}and uniform distributionUU:
DKL\(Pt∥U\)\\displaystyle D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\\|U\)=∑xPt\(x\)logPt\(x\)1/\|𝒳\|\\displaystyle=\\sum\_\{x\}P\_\{t\}\(x\)\\log\\frac\{P\_\{t\}\(x\)\}\{1/\|\\mathcal\{X\}\|\}=∑xPt\(x\)\[logPt\(x\)\+log\|𝒳\|\]\\displaystyle=\\sum\_\{x\}P\_\{t\}\(x\)\[\\log P\_\{t\}\(x\)\+\\log\|\\mathcal\{X\}\|\]=log\|𝒳\|−H\(Pt\)\\displaystyle=\\log\|\\mathcal\{X\}\|\-H\(P\_\{t\}\)whereH\(Pt\):=−∑xPt\(x\)logPt\(x\)H\(P\_\{t\}\):=\-\\sum\_\{x\}P\_\{t\}\(x\)\\log P\_\{t\}\(x\)is the Shannon entropy ofPtP\_\{t\}\.
Now suppose thatPt\(x\)≥mt\>0P\_\{t\}\(x\)\\geq m\_\{t\}\>0for allxx\. Following the same logic as in the proof of Theorem[1](https://arxiv.org/html/2606.00014#Thmtheorem1), we know that if:
\|logPt\(x\)Ptp\(x\)\|\>DKL\(Pt∥U\)mt\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\right\|\>\\frac\{D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\\|U\)\}\{m\_\{t\}\}Then this point would contribute more thanDKL\(Pt∥U\)D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\\|U\)to the KL divergence, leading to a contradiction\. Therefore, for allxx:
\|logPt\(x\)Ptp\(x\)\|≤DKL\(Pt∥U\)mt=log\|𝒳\|−H\(Pt\)mt\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{t\}^\{p\}\(x\)\}\\right\|\\leq\\frac\{D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\\|U\)\}\{m\_\{t\}\}=\\frac\{\\log\|\\mathcal\{X\}\|\-H\(P\_\{t\}\)\}\{m\_\{t\}\}
Substituting into the total bound in Theorem[1](https://arxiv.org/html/2606.00014#Thmtheorem1), we obtain:
\|logPt\(x\)Ps\(x\)−logPtp\(x\)Psp\(x\)\|≤log\|𝒳\|−H\(Pt\)mt\+εsms\\left\|\\log\\frac\{P\_\{t\}\(x\)\}\{P\_\{s\}\(x\)\}\-\\log\\frac\{P\_\{t\}^\{p\}\(x\)\}\{P\_\{s\}^\{p\}\(x\)\}\\right\|\\leq\\frac\{\\log\|\\mathcal\{X\}\|\-H\(P\_\{t\}\)\}\{m\_\{t\}\}\+\\frac\{\\varepsilon\_\{s\}\}\{m\_\{s\}\}
This upper bound is typically looser than the one obtained whenPtpP\_\{t\}^\{p\}approximatesPtP\_\{t\}well \(i\.e\., KL divergence is small\), sincelog\|𝒳\|−H\(Pt\)\\log\|\\mathcal\{X\}\|\-H\(P\_\{t\}\)can be large whenPtP\_\{t\}is sharply peaked\.
∎
\(a\)dynasent
\(b\)semeval
\(c\)adv\_civil
\(d\)implicit\_hate
\(e\)toxigen
\(f\)anli
\(g\)contract\_nli
\(h\)wanli
Figure 5:Different KDE visualization results on all classification tasks\.
## Appendix BExperimental Details
### B\.1Dataset Details
We focus on four core NLP tasks from BOSSYuanet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib18)\), a benchmark suite specifically designed to evaluate the robustness of language models under OOD scenarios: Sentiment Analysis \(SA\), Toxic Detection \(TD\), Natural Language Inference \(NLI\), and Named Entity Recognition \(NER\)\. To balance the number of samples, we randomly select 3,000 training samples per class from the original in\-distribution dataset for SA and NLI, and 5,000 training samples per class for TD\. Accordingly, for testing, we randomly sample up to 1,000 instances per class from the target domain for SA and NLI, and 1,500 test samples per class for TD\. For the NER task, we select 10,000 samples from source dataset that contain only “Location", “Organization", or “Person" entities to unify the label space and select all eligible samples from the target domain for testing\. In our experiments, we do not use the conll dataset because it contains a large number of annotation errors, which could lead to unreliable and unmeasurable outcomes for model evaluation\.
### B\.2Baseline Details
We provide a detailed introduction of the baseline methods used in this section\.
- •RandomPenget al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib14)\)\. We randomly select the required number of samples from the source domain to construct demonstrations\. To reduce performance variance caused by randomness, we repeat this process five times and report the average results for comparison\.
- •KNNLiuet al\.\([2022](https://arxiv.org/html/2606.00014#bib.bib7)\)\. We use the SimCSE representations of samples as the retrieval basis and construct demonstrations by selecting the top nearest samples to the test sample in the representation space\.
- •DrICLLuoet al\.\([2023](https://arxiv.org/html/2606.00014#bib.bib27)\)\. We first use KNN to select the top 30 candidate samples that are most similar to the test sample\. These candidates are then ranked by quantifying their individual contributions to the LLM’s actual predictions \(We use LLaMA3\.2\-3B in LLMs that can not be deployed locally\)\. The top 10 are treated as positive examples and the bottom 10 as negative ones to train a dual\-encoder neural retriever, GTRNiet al\.\([2022](https://arxiv.org/html/2606.00014#bib.bib33)\), which is subsequently used for demonstration retrieval\.
- •RewriteMadine \([2024](https://arxiv.org/html/2606.00014#bib.bib13)\)\. We perform KNN\-based demonstration retrieval and rewrite the retrieved samples according to the style of the test sample, so that the demonstrations better align with the target domain\. In contrast to the original method, we adapt the rewriting strategy under a strict target\-unavailable setting, where only a single test instance is exposed at a time, rather than a set of target samples\.
- •InfICLS\.et al\.\([2024](https://arxiv.org/html/2606.00014#bib.bib9)\)\. It estimates the influence of each candidate demonstration on the model’s prediction for a given test input, and to select those demonstrations that have the most beneficial effect\. By leveraging gradient\-based influence approximations, the method identifies which demonstrations most positively affect the model’s output distribution without requiring extensive evaluation over all combinations\.
- •DICLKapuriyaet al\.\([2025](https://arxiv.org/html/2606.00014#bib.bib38)\)\. It employs Maximum Marginal Relevance \(MMR\), which jointly considers the relevance between the input and samples as well as the mutual dissimilarity among them\. This allows the selected context to maintain high relevance with greater diversity\.
### B\.3More Experimental Settings
To prevent potential bias caused by an imbalanced number of samples per label in the demonstrations, we retrieve the same number of samplesNNfor each label\. Therefore, for classification tasks, the total number of demonstrations isN×\|Y\|N\\times\|Y\|, where\|Y\|\|Y\|is the number of labels\. However, for generative tasks that do not involve specific class labels, we directly set the number of demonstrations toNN\. For classification tasks, we set the number of demonstrationsCCin the initial demonstration set to\|Y\|\|Y\|, while for generative tasks, we directly setCCto 1\. For instruction fine\-tuning, we use the source domain data and convert it into training samples following the instruction format of BOSS\. During training, we apply LoRA with a learning rate of 1e\-5 for one epoch\. ForGPT4o\-miniandGPT3\.5\-turbo, we use the APIs provided by xi\-ai333https://api\.xi\-ai\.cn/\. For experiments involving closed\-source LLMs \(e\.g\., GPT\-based APIs\), we use Llama 3\.2\-3B as the proxy model for both the source and target domains\. While this proxy cannot perfectly replicate the behavior of the closed\-source model, it provides a practical and consistent reference when internal representations and training access are unavailable\.
To compare the performance of DOPA and baselines across multiple datasets, we employ the Wilcoxon Signed\-Rank Test which is widely used for model comparison across multiple benchmarksDemsar \([2006](https://arxiv.org/html/2606.00014#bib.bib37)\)\. This non\-parametric statistical test is specifically designed for paired samples and does not assume normality of the underlying distribution\. In our setting, the paired observations correspond to the performance scores of the two models \(DOPA and any other baseline\) on the same datasets\. If DOPA shows statistically significant improvements \(p≤0\.05p\\leq 0\.05\) over all baselines, we denote it as DOPA\*\.
Model Parameters Download:GPT2\-xl,Qwen3\-1\.7B444https://huggingface\.co/Qwen/Qwen3\-1\.7B,Qwen3\-8B555https://huggingface\.co/Qwen/Qwen3\-8B,Gemma2\-2B666https://huggingface\.co/google/gemma\-2b, andLLaMA3\.2\-3B777https://huggingface\.co/meta\-llama/Llama\-3\.2\-3B,LLaMA3\.1\-8B888https://huggingface\.co/meta\-llama/Llama\-3\.1\-8B\.
## Appendix CProxy Sensitivity
Table 5:Performance comparison under different target proxy settings\.We evaluate the proxy sensitivity of DOPA by replacing the target proxy with LLMs that differ from those in the source proxy\. Specifically, we show the replacement results usingLlama3\.1\-8BandQwen3\-1\.7Bfor the SA task in Table[5](https://arxiv.org/html/2606.00014#A3.T5)\. The results show that while proxy mismatches may affect performance, DOPA still consistently outperforms KNN\. This provides further evidence supporting the robustness of the proxy\-based design\. Additionally, we found that using models from different series \(Qwen3\-1\.7B\) yields inferior replacement results compared to using models from the same series \(Llama3\.1\-8B\), indicating that the source domain and target domain proxies should be as similar as possible to ensure accurate quantification of the characteristics of different samples\.
## Appendix DExperiments on More Tasks
Table 6:Experimental results on QA tasks\.We further briefly evaluate the effectiveness of DOPA on extractive Question Answering \(QA\) tasks in the BOSS Benchmark\. During this process, 10,000 samples are drawn from each dataset, regardless of whether it belongs to the source domain or the target domain\. We adopt Exact Match \(EM\) and F1 as evaluation metrics\. EM measures whether the predicted answer exactly matches the ground\-truth span, reflecting the model’s strict answer prediction accuracy\. F1 computes the token\-level overlap between the predicted and gold answers, providing a softer evaluation of partial matching quality and semantic coverage\.
As shown in Table[6](https://arxiv.org/html/2606.00014#A4.T6), DOPA consistently outperforms the KNN baseline across all datasets and model settings, demonstrating its effectiveness on extractive QA tasks\. Specifically, for LLaMA3\.2\-3B, DOPA improves the average performance from 43\.60 to 44\.73, yielding a gain of 1\.13 points\. For Qwen3\-1\.7B, DOPA boosts the average score from 45\.50 to 46\.35, achieving an improvement of 0\.85 points\. A closer examination of individual datasets further shows that DOPA delivers consistent improvements on advQA, searchQA, and newsQA, indicating strong robustness across diverse domains and question distributions\. These results validate that the effectiveness of DOPA extends beyond classification and NER tasks and generalizes well to more challenging extractive QA scenarios\.
## Appendix ECase Study
Figure 6:Case study on sst and implicit\.The examples in Figure[6](https://arxiv.org/html/2606.00014#A5.F6)illustrate that both DOPA and DOPA\-mahselect samples with stylistic expressions closely aligned with the test inputs, capturing similar tone, sentence structure, and emotional/toxicity intensity\. However, DOPA demonstrates slightly better diversity: in sst, while both methods retrieve strongly negative, concise opinions, DOPA’s samples vary slightly more in content and phrasing\. In the implicit task, both methods capture politically charged and provocative language, but DOPA avoids redundancy by selecting stylistically consistent yet semantically distinct sentences\. In contrast, KNN selects samples that, although semantically related, deviate significantly in style—favoring longer or expository sentences that mismatch the terse nature of the test examples\. Overall, DOPA achieves stronger style alignment with greater diversity, while KNN struggles to capture the nuanced stylistic cues of the target domain\.
\(a\)dynasent
\(b\)semeval
\(c\)adv\_civil
\(d\)implicit\_hate
\(e\)toxigen
\(f\)anli
\(g\)contract\_nli
\(h\)wanli
Figure 7:Euclidean distance comparison to target domain samples for retrieval results with and without the diversity constraint \(with MahDist and w/o MahDist\) on all tasks\.
## Appendix FMore Visualization Results
We further present KDE distributions of sample representations across various tasks in Figure[5](https://arxiv.org/html/2606.00014#A1.F5)to demonstrate the generality of DOPA in selecting appropriate samples\. Overall, the samples selected by the proxy consistently exhibit a distribution that shifts away from the source domain and moves closer to the target domain\. For example, on theimplicit\_hatedataset, the proxy\-based distribution almost completely overlaps with that of the target domain\. This demonstrates DOPA’s capability to effectively identify samples with similar underlying distributions to the target domain\. But we also observe that in a few cases \(e\.g\.,anli\), the proxy\-based distribution fails to effectively deviate from the source domain\. This may be attributed to the nature ofanliitself, which is a human\-crafted adversarial benchmark, making it challenging for the model to accurately capture its characteristics\. We do not perform the corresponding visualization experiments on the NER dataset because it is not a sentence\-level task, making it difficult to obtain the relevant probability distributions\. Overall, DOPA’s ability to capture approximate distributions exhibits generalization and can perform well across most tasks\.
Figure[7](https://arxiv.org/html/2606.00014#A5.F7)illustrates the effect of the diversity constraint across additional datasets\. Similar to Figure[4\(c\)](https://arxiv.org/html/2606.00014#S4.F4.sf3), the curve fitted under the with MahDist setting demonstrates greater diversity\. Together with the ablation study, this provides strong evidence for the effectiveness of the diversity constraint component in DOPA\.Similar Articles
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
This paper introduces DOSER, a framework using diffusion models for out-of-distribution detection and selective regularization in offline reinforcement learning. It aims to improve performance on static datasets by distinguishing between beneficial and detrimental OOD actions.
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
This paper introduces EDGE-OPD, a modification of on-policy self-distillation for LLMs that uses guided rollouts and evidence masks to internalize privileged context without degrading general capabilities, showing success in rare-token identity settings.
When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning
This paper reveals a counterintuitive phenomenon where correct demonstrations in in-context learning can degrade model accuracy, introducing task preserving perturbations to study the gap between exemplar correctness and utility.
DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
DiPO introduces a novel reinforcement learning approach for LLMs that uses perplexity-based sample partitioning to disentangle exploration and exploitation subspaces, combined with a bidirectional reward allocation mechanism for more stable policy optimization. The method demonstrates superior performance on mathematical reasoning and function calling tasks.
Self-Improving In-Context Learning
This paper proposes a method to improve in-context learning by optimizing the continuous embeddings of a fixed few-shot prompt at test time, using a self-supervised confidence proxy derived from the model's log-probabilities without requiring fine-tuning or token generation.