ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

arXiv cs.AI Papers

Summary

Proposes ARIADNE, a training-free, adapter-agnostic routing framework that selects the optimal PEFT adapter at inference time by measuring input proximity to adapter-specific centroids in embedding space, recovering 97.44% of upper-bound performance on 23 tasks.

arXiv:2606.19079v1 Announce Type: new Abstract: The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:41 AM

# ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection
Source: [https://arxiv.org/html/2606.19079](https://arxiv.org/html/2606.19079)
Enrico Cassano1,2,Michał Brzozowski2, Zuzanna Dubanowska2,Paolo Mandica2,Neo Christopher Chung2 1University of Turin,2Samsung AI Center, Warsaw, Poland Correspondence:enrico\.cassano@unito\.it

###### Abstract

The increasing deployment of parameter\-efficient fine\-tuning \(PEFT\) has led to model ecosystems in which a single backbone is paired with many task\-specialized adapters\. In this setting, inference\-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool\. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient\-based statistics, or require additional router training, which limits scalability and portability as new adapters are added\. We introduceARIADNE, a training\-free, adapter\-agnostic routing framework for dynamic adapter selection at inference time\.ARIADNErepresents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter\. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space\. Because routing is performed entirely in the input embedding space,ARIADNEis compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures\. Primarily evaluated with Llama 3\.2 1B Instruct on 23 diverse NLP tasks,ARIADNErecovers 97\.44% of the upper bound performance\. Scaling to 44 tasks, it achieves 89\.7% average selection accuracy, without additional training or access to adapter internals\.

ARIADNE: Agnostic Routing for Inference\-time Adapter DyNamic sElection

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.19079v1/x1.png)Figure 1:Adapter SA comparison betweenARIADNEand spectral routing methods Arrow and SpectR\.ARIADNEconsistently outperforms both across all tasks\.The proliferation of parameter\-efficient fine\-tuning \(PEFT\) methods has fundamentally altered the landscape of language model adaptation\. Rather than fine\-tuning monolithic models end\-to\-end, practitioners now maintain growing libraries of lightweight adapters\(Huet al\.,[2022](https://arxiv.org/html/2606.19079#bib.bib12); Houlsbyet al\.,[2019](https://arxiv.org/html/2606.19079#bib.bib13)\), each specializing a shared backbone for a particular task or domain\. This modular paradigm offers compelling advantages in terms of storage, compute, and composability\. Yet it introduces a critical challenge: given an input without task label and a library ofnnspecialized adapters, how does one select the most appropriate one without the overhead of additional training, labeled data, or privileged access to adapter internals?

Existing approaches to this routing problem can be broadly divided into two families\. The first employsretrieval\-basedmechanisms trained on labeled task data\. LoraRetriever\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.19079#bib.bib8)\)fine\-tunes a sentence embedding model via contrastive learning to align inputs with adapter representations, achieving strong performance but requiring an additional supervised training phase and access to each adapter’s training distribution\. A second family of methods usesspectral routing, deriving routing signals directly from adapter weights\. Representative examples include Arrow\(Ostapenkoet al\.,[2024](https://arxiv.org/html/2606.19079#bib.bib11)\)and SpectR\(Fleshman and Van Durme,[2025](https://arxiv.org/html/2606.19079#bib.bib32)\), which build prototypes from the SVD of each LoRA weight matrix and assign inputs based on the alignment between these prototypes and the model’s hidden states\. Although these approaches operate in a zero\-shot setting, they are strictly designed around the LoRA formulation and do not naturally generalize to other PEFT methodsKopiczkoet al\.\([2023](https://arxiv.org/html/2606.19079#bib.bib16)\); Liuet al\.\([2024](https://arxiv.org/html/2606.19079#bib.bib17)\)\. Moreover, empirical results show that these methods struggle when adapters are similar, with Arrow performing close to random chance on several benchmarks\(Fleshman and Van Durme,[2025](https://arxiv.org/html/2606.19079#bib.bib32)\)\.

We proposeARIADNE\(AgnosticRouting forInference\-timeAdapterDyNamic sElection\), a zero\-shot routing framework for dynamic adapter selection compatible with any PEFT architecture\. The key insight is framing adapter routing as aninput classification problem: the latent geometry of a frozen, off\-the\-shelf text encoder is sufficient to distinguish task distributions, without relying on adapter weights or gradients\. Inputs from the same task naturally cluster in this space, making it a reliable signal for routing\. For each task, we construct a set ofmmrepresentative centroids by clustering embeddings of samples drawn from its training set\. At inference time, an input is projected into the same space, and the adapter whose centroids set yields the highest cosine similarity is selected\. We instantiateARIADNEon top of Llama 3\.2 1B Instruct and Qwen2\.5 3B Instruct, and evaluate it end\-to\-end on 23 diverse NLP tasks, measuring both adapter Selection Accuracy \(SA\) and Task Performance \(TP\) against an oracle upper bound\.ARIADNEachieves an average TP of 54\.74%, recovering 97\.44% of the average oracle performance \(56\.18%\)\. On the 5\-task subset shared with Arrow and SpectR,ARIADNEconsistently outperforms both baselines in adapter SA\. An extended scalability study on 44 tasks confirms that routing performance remains stable as the adapter library grows, reaching an average SA of 89\.7%\.

Our main contributions are:

- •We reframe adapter routing as aninput classification problemand propose a zero\-shot centroid\-based routing mechanism that operates entirely in the input embedding space\.
- •Thanks to its reliance solely on the input space rather than adapter weight decompositions,ARIADNEis, by construction, compatible with any PEFT architecture\.
- •We show empirically that input geometry alone provides a strong signal for effective adapter selection, outperforming spectral routing methods while scaling robustly\.
- •We conduct a systematic analysis of routing failure modes and show that errors are concentrated within semantically related task clusters, leading to graceful rather than catastrophic degradation\.

## 2Related Work

Parameter\-Efficient Fine\-Tuning\.The growth of large pretrained language models has made full fine\-tuning increasingly impractical\. Parameter\-efficient fine\-tuning \(PEFT\) methods address this by updating only a small subset of parameters while keeping the backbone frozen\(Houlsbyet al\.,[2019](https://arxiv.org/html/2606.19079#bib.bib13); Huet al\.,[2022](https://arxiv.org/html/2606.19079#bib.bib12); Li and Liang,[2021](https://arxiv.org/html/2606.19079#bib.bib14); Liuet al\.,[2022](https://arxiv.org/html/2606.19079#bib.bib15)\)\. Among these, Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2022](https://arxiv.org/html/2606.19079#bib.bib12)\)has emerged as the dominant paradigm: for each weight matrixW∈ℝd×kW\\in\\mathbb\{R\}^\{d\\times k\}, it introduces a residual updateΔ​W=B​A\\Delta W=BAwithB∈ℝd×rB\\in\\mathbb\{R\}^\{d\\times r\},A∈ℝr×kA\\in\\mathbb\{R\}^\{r\\times k\}, andr≪m​i​n​\(d,k\)r\\ll min\(d,k\), reducing the number of trainable parameters by orders of magnitude while preserving competitive downstream performance\. Subsequent variants including VeRA\(Kopiczkoet al\.,[2023](https://arxiv.org/html/2606.19079#bib.bib16)\), DoRA\(Liuet al\.,[2024](https://arxiv.org/html/2606.19079#bib.bib17)\), AdaLoRA\(Zhanget al\.,[2023](https://arxiv.org/html/2606.19079#bib.bib18)\), and GPart\(Mandicaet al\.,[2026](https://arxiv.org/html/2606.19079#bib.bib35)\)further improve parameter efficiency and flexibility\. Our work does not modify or extend any specific PEFT method; instead, it provides routing that operates independently of the adapter implementation\.

Adapter Selection\.Existing approaches vary considerably in the resources and assumptions they require, ranging from methods that train dedicated routing components to fully zero\-shot, data\-free alternatives\. LoRARetrieverZhaoet al\.\([2024](https://arxiv.org/html/2606.19079#bib.bib8)\)frames adapter selection as a retrieval and composition problem, but requires training a dedicated retrieval component on top of each adapter’s training data, a costly overhead thatARIADNEavoids entirely\. More closely related to our approach are methods that perform zero\-shot routing by exploiting the internal structure of adapter weights\. ARROWOstapenkoet al\.\([2024](https://arxiv.org/html/2606.19079#bib.bib11)\)uses the first right singular vector of each adapter’s weight product matrix, obtained via SVD, as a proxy for its training distribution, while SpectRFleshman and Van Durme \([2025](https://arxiv.org/html/2606.19079#bib.bib32)\)extends this idea by leveraging the full covariance spectrum\. Both methods require white\-box access to adapter internals, and their reliance on the SVD of LoRA weight matrices ties them architecturally to the LoRA family and, implicitly, to the underlying base model\. Furthermore,Fleshman and Van Durme \([2025](https://arxiv.org/html/2606.19079#bib.bib32)\)show that these spectral proxies can be unreliable: ARROW degrades to near\-random routing accuracy on highly similar task pairs, while SpectR falls even below the random threshold in the same setting\.

Like ARROW and SpectR,ARIADNErequires no training of additional components\. Unlike these methods, however, it grounds routing entirely in the latent geometry of a frozen, off\-the\-shelf encoder, decoupling the routing mechanism from both adapter internals and the underlying PEFT architecture\. As a result,ARIADNEis agnostic to both adapter type and base model, enabling the same routing method to transfer across model families and scales\. This design is further motivated by recent evidence that internal model representations are unreliable as general\-purpose routing signals, with spectral approaches often degrading out\-of\-distribution\(Dubanowskaet al\.,[2025](https://arxiv.org/html/2606.19079#bib.bib31)\)\.

## 3Methodology

Routing Without Adapter Access\.A central design choice inARIADNEis to ground routing decisions exclusively in the latent geometry of a text encoder, rather than on adapter internals\. This decoupling is both principled and practical: adapter weights encode the output of a training process whose data distribution and optimization trajectory are opaque at deployment time, and weight\-space signals offer no guarantee of correspondence with task boundaries in input space\(Fleshman and Van Durme,[2025](https://arxiv.org/html/2606.19079#bib.bib32); Dubanowskaet al\.,[2025](https://arxiv.org/html/2606.19079#bib.bib31)\)\. By operating only on inputs,ARIADNEoffers three properties that spectral routing methods cannot provide: compatibility with any PEFT architecture by design, straightforward extension to new tasks by simply computing centroids from training samples, and independence from the underlying backbone, enabling the same routing infrastructure transfers across model families and scales\.

Problem Formulation\.Let𝒯=\{T1,…,Tn\}\\mathcal\{T\}=\\\{T\_\{1\},\\ldots,T\_\{n\}\\\}denote a set ofnntasks\. Each taskTiT\_\{i\}is associated with a dataset𝒟i=\{\(xi,k,yi,k\)\}k=1Ni\\mathcal\{D\}\_\{i\}=\\\{\(x\_\{i,k\},y\_\{i,k\}\)\\\}\_\{k=1\}^\{N\_\{i\}\}, whereNiN\_\{i\}is the number of examples in taskTiT\_\{i\},xi,kx\_\{i,k\}is thekk\-th input, andyi,ky\_\{i,k\}is its corresponding label\. The full multi\-task dataset is then defined as𝒟=⋃i=1n𝒟i\\mathcal\{D\}=\\bigcup\_\{i=1\}^\{n\}\\mathcal\{D\}\_\{i\}\. We consider a base language modelLLand a library ofnntask\-specific adaptersΦ=\{ϕ1,…,ϕn\}\\Phi=\\\{\\phi\_\{1\},\\ldots,\\phi\_\{n\}\\\}, where eachϕi\\phi\_\{i\}is optimized forTiT\_\{i\}\. Under a mixed\-task scenario, an inputxxis submitted toLLwithout a task label, and the objective is to select the adapter best suited to process it\.

Adapter Library\.We train LoRA adapters on state\-of\-the\-art tasks that span four semantic categories\. These adapters are fundamental to our evaluation: they enable end\-to\-end measurement of routing quality under realistic conditions, and allow us to characterize what happens when routing fails\. Full details on tasks and training are provided in Appendix[A\.3](https://arxiv.org/html/2606.19079#A1.SS3)\.

Dynamic Selection\.For each taskTiT\_\{i\}, we represent its input distribution with a set ofmmtask\-representative centroids𝒞i=\{ci,j\}j=1m\\mathcal\{C\}\_\{i\}=\\\{c\_\{i,j\}\\\}\_\{j=1\}^\{m\}, computed in the embedding space of a frozen auxiliary encodere​\(⋅\)e\(\\cdot\)\. To construct these centroids, we samplemmsubsetsSi,j⊂𝒟iS\_\{i,j\}\\subset\\mathcal\{D\}\_\{i\}with different strategies \(Appendix[A\.10](https://arxiv.org/html/2606.19079#A1.SS10)\), and average the embeddings of their inputs\. Formally, for eachj∈\{1,…,m\}j\\in\\\{1,\\ldots,m\\\}, thejj\-th centroid for taskTiT\_\{i\}is defined as

ci,j=1\|Si,j\|​∑\(x,y\)∈Si,je​\(x\)\.c\_\{i,j\}=\\frac\{1\}\{\|S\_\{i,j\}\|\}\\sum\_\{\(x,y\)\\in S\_\{i,j\}\}e\(x\)\.\(1\)This multi\-centroid representation captures intra\-task variability more effectively than a single global prototype\. At inference time, an unlabeled inputxxis embedded in the same space, and the routing function selects the adapter associated with the most similar task centroid:

i∗=arg⁡maxi⁡\(maxc∈𝒞i⁡cos⁡\(e​\(x\),c\)\)\.i^\{\*\}=\\arg\\max\_\{i\}\\left\(\\max\_\{c\\in\\mathcal\{C\}\_\{i\}\}\\cos\(e\(x\),c\)\\right\)\.\(2\)The selected adapter is thenϕi∗\\phi\_\{i^\{\*\}\}\.

## 4Experiments

CategorySABase ModelOracleARIADNERecoup %NLI81%23\.37%60\.16%58\.06%96\.51%QA83%22\.91%51\.55%49\.14%95\.32%Similarity96%19\.42%66\.60%65\.20%97\.92%Reasoning100%23\.38%46\.40%46\.40%100\.0%Avg\.85%22\.27%56\.18%54\.74%97\.44%
Table 1:End\-to\-end performance across 23 tasks grouped by semantic category\. SA: adapter Selection Accuracy; TP: Task Performance; Recoup: TP recovered relative to Oracle\.Setup\.We evaluateARIADNEon top of Llama 3\.2 1B Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.19079#bib.bib29)\)and Qwen2\.5 3B InstructTeam \([2024](https://arxiv.org/html/2606.19079#bib.bib36)\)\(results in Appendix[A\.2](https://arxiv.org/html/2606.19079#A1.SS2)\)\. We use a library of 23 LoRA adapters, each trained independently on a single task in the span of four semantic categories: NLI, QA, Similarity, and Reasoning, covering established benchmarks\. The frozen embeddere​\(⋅\)e\(\\cdot\)isintfloat/e5\-large\-v2\(Wanget al\.,[2022](https://arxiv.org/html/2606.19079#bib.bib30)\)\. We report the motivation for this choice in Appendix[A\.5](https://arxiv.org/html/2606.19079#A1.SS5)\. Our main results are computed with up to 500 samples per centroid and centroids numberm=5m=5\. A robustness study training samples number is in Appendix[A\.9](https://arxiv.org/html/2606.19079#A1.SS9), and the selection of themmvalue is in Appendix[A\.6](https://arxiv.org/html/2606.19079#A1.SS6)\.

Comparison with Spectral Routing\.We compare adapter SA against Arrow\(Ostapenkoet al\.,[2024](https://arxiv.org/html/2606.19079#bib.bib11)\)and SpectR\(Fleshman and Van Durme,[2025](https://arxiv.org/html/2606.19079#bib.bib32)\)on the 5\-task intersection shared across evaluations \(HellaSwagZellerset al\.\([2019](https://arxiv.org/html/2606.19079#bib.bib37)\), MNLIWilliamset al\.\([2018](https://arxiv.org/html/2606.19079#bib.bib38)\), MRPCWanget al\.\([2018](https://arxiv.org/html/2606.19079#bib.bib34)\), QQPWanget al\.\([2018](https://arxiv.org/html/2606.19079#bib.bib34)\), SST\-2\(Socheret al\.,[2013](https://arxiv.org/html/2606.19079#bib.bib39)\)\)\.

Adapter Selection and End\-to\-End Performance\.For each of the 23 selected tasks, we assess the SA on 50 random test samples\. We then evaluate the pipeline end\-to\-end by testing the adapter selected byARIADNEand measuring the TP against an Oracle\. This Oracle is defined as a configuration that always uses the correct adapter\. The gap between these two metrics directly quantifies the performance cost attributable to routing errors\. Finally, to assess scalability, we report SA across an extended set of 44 tasks\.

## 5Results

Comparison with Spectral Routing\.Figure[1](https://arxiv.org/html/2606.19079#S1.F1)comparesARIADNEwith Arrow and SpectR on the 5\-task benchmark shared across all three methods\. The approaches differ in their underlying assumptions: Arrow and SpectR do not need training data but require white\-box access to LoRA weight matrices, whereasARIADNEuses training samples without requiring access to adapter internals\.ARIADNEachieves the best performance on every task, with the largest gains on MRPC and QQP, where spectral methods degrade to near\-random routing accuracy, consistent with their known failure mode\(Fleshman and Van Durme,[2025](https://arxiv.org/html/2606.19079#bib.bib32)\)\.

Adapter Selection and Performance\.Table[1](https://arxiv.org/html/2606.19079#S4.T1.fig1)summarizes results on 23 tasks grouped by semantic category\.ARIADNEachieves a zero\-shot average SA of 85%, translating to an average TP of 54\.74% and recovering 97\.44% of Oracle performance\. Routing is most reliable in the Similarity and Reasoning categories, where near\-perfect selection closes the gap to the Oracle\. NLI is the most challenging category, with 81% SA, largely due to the high similarity\. Even in this setting, however,ARIADNErecovers 96\.51% of oracle performance, indicating that routing errors tend to select semantically related adapters rather than causing catastrophic degradation\. Full per\-task results are reported in Appendix[A\.1](https://arxiv.org/html/2606.19079#A1.SS1)\. In Appendix[A\.2](https://arxiv.org/html/2606.19079#A1.SS2), are reported the end\-to\-end performances with an additional backbone, Qwen2\.5 3B Instruct\. We report the overhead ofARIADNEin Appendix[A\.7](https://arxiv.org/html/2606.19079#A1.SS7)\.

Scalability\.SA degrades gracefully as the adapter library grows, stabilizing at an average SA of 89\.7% across 44 tasks\. As shown in Figure[2](https://arxiv.org/html/2606.19079#S5.F2.fig1), the initial drop reflects the increasing difficulty of distinguishing semantically proximate tasks as the candidate pool expands\. Crucially, performance plateaus beyond approximately 20 adapters rather than continuing to degrade\. This trend is consistent with our failure mode analysis, which shows that routing errors are concentrated within semantic clusters\. The SA for each task is reported in Appendix[A\.8](https://arxiv.org/html/2606.19079#A1.SS8)\.

![Refer to caption](https://arxiv.org/html/2606.19079v1/x2.png)

Figure 2:SA trend for up to 44 tasks\.Graceful Degradation\.Severe routing failure occurs on SQuAD V1Rajpurkaret al\.\([2016](https://arxiv.org/html/2606.19079#bib.bib42)\), which achieves 0% SA because it is consistently routed to the SQuAD V2Rajpurkaret al\.\([2018](https://arxiv.org/html/2606.19079#bib.bib43)\)adapter\.ARIADNEstill recovers 85% of Oracle performance \(64% vs 75% TP\), since the two tasks are semantically close and their adapters transfer well\. This behavior highlights a major advantage ofARIADNE: since routing is performed in a semantically‑structured space, errors tend to result in graceful degradation, making routing errors easier to diagnose \(Appendix[A\.4](https://arxiv.org/html/2606.19079#A1.SS4)\)\.

## 6Conclusions

We introducedARIADNE, a zero\-shot framework for dynamic adapter selection that reframes routing as an input classification problem\. By operating in the latent space of a frozen text encoder rather than relying on adapter weight decompositions,ARIADNEis, by construction, compatible with arbitrary PEFT architectures\. Evaluated across 23 tasks, it recovers 97\.44% of Oracle performance and outperforms spectral routing methods on shared tasks, suggesting that input geometry provides an effective signal for adapter selection\.

## Limitations

The primary limitation ofARIADNEis that it requires access to training data to compute task centroids\. As a result, it is not directly applicable to decentralized adapter ecosystems where training data is proprietary or unavailable\. A promising direction for future work is to combineARIADNEwith knowledge extraction in order to derive task\-descriptive signals directly from the adapters, enabling the construction of input\-space fingerprints without requiring access to training samples\. This would make our approach fully data\-free while preserving its adapter\-agnostic nature\. At the same time, in many practical PEFT settings, adapters are released together with at least some information about their training data or task domain\.

## References

- M\. Bañón, P\. Chen, B\. Haddow, K\. Heafield, H\. Hoang, M\. Esplà\-Gomis, M\. L\. Forcada, A\. Kamran, F\. Kirefu, P\. Koehn, S\. Ortiz Rojas, L\. Pla Sempere, G\. Ramírez\-Sánchez, E\. Sarrías, M\. Strelec, B\. Thompson, W\. Waites, D\. Wiggins, and J\. Zaragoza \(2020\)ParaCrawl: web\-scale acquisition of parallel corpora\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4555–4567\.External Links:[Link](https://aclanthology.org/2020.acl-main.417/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.417)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.30.28.1)\.
- PIQA: reasoning about physical commonsense in natural language\.InThirty\-Fourth AAAI Conference on Artificial Intelligence,Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.33.31.1)\.
- O\. r\. Bojar, R\. Chatterjee, C\. Federmann, Y\. Graham, B\. Haddow, M\. Huck, A\. Jimeno Yepes, P\. Koehn, V\. Logacheva, C\. Monz, M\. Negri, A\. Neveol, M\. Neves, M\. Popel, M\. Post, R\. Rubino, C\. Scarton, L\. Specia, M\. Turchi, K\. Verspoor, and M\. Zampieri \(2016\)Findings of the 2016 conference on machine translation\.InProceedings of the First Conference on Machine Translation,Berlin, Germany,pp\. 131–198\.External Links:[Link](http://www.aclweb.org/anthology/W/W16/W16-2301)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.27.25.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.34.32.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.35.33.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.36.34.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.37.35.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.44.42.1)\.
- O\. Bojar, C\. Buck, C\. Federmann, B\. Haddow, P\. Koehn, J\. Leveling, C\. Monz, P\. Pecina, M\. Post, H\. Saint\-Amand, R\. Soricut, L\. Specia, and A\. s\. Tamchyna \(2014\)Findings of the 2014 workshop on statistical machine translation\.InProceedings of the Ninth Workshop on Statistical Machine Translation,Baltimore, Maryland, USA,pp\. 12–58\.External Links:[Link](http://www.aclweb.org/anthology/W/W14/W14-3302)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.39.37.1)\.
- S\. R\. Bowman, G\. Angeli, C\. Potts, and C\. D\. Manning \(2015\)A large annotated corpus for learning natural language inference\.InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,L\. Màrquez, C\. Callison\-Burch, and J\. Su \(Eds\.\),Lisbon, Portugal,pp\. 632–642\.External Links:[Link](https://aclanthology.org/D15-1075),[Document](https://dx.doi.org/10.18653/v1/D15-1075)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.20.18.1)\.
- D\. Cer, M\. Diab, E\. Agirre, I\. Lopez\-Gazpio, and L\. Specia \(2017\)SemEval\-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation\.InProceedings of the 11th international workshop on semantic evaluation \(SemEval\-2017\),pp\. 1–14\.Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.9.7.1)\.
- C\. Christopher, L\. Kenton, C\. Ming\-Wei, K\. Tom, C\. Michael, and T\. Kristina \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InNAACL,Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.12.10.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv:1803\.05457v1\.Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.14.12.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.25.23.1)\.
- Y\. Dong, Y\. Lin, and X\. Yang \(2025\)CoPA: hierarchical concept prompting and aggregating network for explainable diagnosis\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 67–76\.Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.46.44.1)\.
- D\. Dua, Y\. Wang, P\. Dasigi, G\. Stanovsky, S\. Singh, and M\. Gardner \(2019\)DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs\.InProc\. of NAACL,Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.29.27.1)\.
- Z\. Dubanowska, M\. Żelaszczyk, M\. Brzozowski, P\. Mandica, and M\. P\. Karpowicz \(2025\)Representation\-based broad hallucination detectors fail to generalize out of distribution\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 17563–17575\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.952/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.952),ISBN 979\-8\-89176\-335\-7Cited by:[§2](https://arxiv.org/html/2606.19079#S2.p3.1),[§3](https://arxiv.org/html/2606.19079#S3.p1.1)\.
- O\. Dušek, J\. Novikova, and V\. Rieser \(2020\)Evaluating the State\-of\-the\-Art of End\-to\-End Natural Language Generation: The E2E NLG Challenge\.Computer Speech & Language59,pp\. 123–156\.External Links:[Document](https://dx.doi.org/10.1016/j.csl.2019.06.009),1901\.11528Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.41.39.1)\.
- W\. Fleshman and B\. Van Durme \(2025\)SpectR: dynamically composing LM experts with spectral routing\.arXiv preprint arXiv:2504\.03454\.Cited by:[§1](https://arxiv.org/html/2606.19079#S1.p2.1),[§2](https://arxiv.org/html/2606.19079#S2.p2.1),[§3](https://arxiv.org/html/2606.19079#S3.p1.1),[§4](https://arxiv.org/html/2606.19079#S4.p2.1),[§5](https://arxiv.org/html/2606.19079#S5.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4](https://arxiv.org/html/2606.19079#S4.p1.3)\.
- N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly \(2019\)Parameter\-efficient transfer learning for nlp\.InInternational conference on machine learning,pp\. 2790–2799\.Cited by:[§1](https://arxiv.org/html/2606.19079#S1.p1.1),[§2](https://arxiv.org/html/2606.19079#S2.p1.5)\.
- E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§1](https://arxiv.org/html/2606.19079#S1.p1.1),[§2](https://arxiv.org/html/2606.19079#S2.p1.5)\.
- M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer \(2017\)triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension\.arXiv e\-prints,pp\. arXiv:1705\.03551\.External Links:1705\.03551Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.6.4.1)\.
- D\. Khashabi, S\. Chaturvedi, M\. Roth, S\. Upadhyay, and D\. Roth \(2018\)Looking beyond the surface:a challenge set for reading comprehension over multiple sentences\.InProceedings of North American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.40.38.1)\.
- D\. J\. Kopiczko, T\. Blankevoort, and Y\. M\. Asano \(2023\)Vera: vector\-based random matrix adaptation\.arXiv preprint arXiv:2310\.11454\.Cited by:[§1](https://arxiv.org/html/2606.19079#S1.p2.1),[§2](https://arxiv.org/html/2606.19079#S2.p1.5)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, M\. Kelcey, J\. Devlin, K\. Lee, K\. N\. Toutanova, L\. Jones, M\. Chang, A\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association of Computational Linguistics\.Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.13.11.1)\.
- H\. Levesque, E\. Davis, and L\. Morgenstern \(2012\)The winograd schema challenge\.InThirteenth International Conference on the Principles of Knowledge Representation and Reasoning,Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.3.1.1)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 4582–4597\.Cited by:[§2](https://arxiv.org/html/2606.19079#S2.p1.5)\.
- B\. Y\. Lin, W\. Zhou, M\. Shen, P\. Zhou, C\. Bhagavatula, Y\. Choi, and X\. Ren \(2020\)CommonGen: a constrained text generation challenge for generative commonsense reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2020,Online,pp\. 1823–1840\.External Links:[Link](https://www.aclweb.org/anthology/2020.findings-emnlp.165),[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.165)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.22.20.1)\.
- H\. Liu, D\. Tam, M\. Muqeeth, J\. Mohta, T\. Huang, M\. Bansal, and C\. A\. Raffel \(2022\)Few\-shot parameter\-efficient fine\-tuning is better and cheaper than in\-context learning\.Advances in Neural Information Processing Systems35,pp\. 1950–1965\.Cited by:[§2](https://arxiv.org/html/2606.19079#S2.p1.5)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024\)Dora: weight\-decomposed low\-rank adaptation\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.19079#S1.p2.1),[§2](https://arxiv.org/html/2606.19079#S2.p1.5)\.
- A\. L\. Maas, R\. E\. Daly, P\. T\. Pham, D\. Huang, A\. Y\. Ng, and C\. Potts \(2011\)Learning word vectors for sentiment analysis\.InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies,Portland, Oregon, USA,pp\. 142–150\.External Links:[Link](http://www.aclweb.org/anthology/P11-1015)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.38.36.1)\.
- P\. Mandica, M\. Brzozowski, Z\. Dubanowska, and N\. C\. Chung \(2026\)GPart: end\-to\-end isometric fine\-tuning via global parameter partitioning\.External Links:2605\.14841,[Link](https://arxiv.org/abs/2605.14841)Cited by:[§2](https://arxiv.org/html/2606.19079#S2.p1.5)\.
- L\. Nan, D\. Radev, R\. Zhang, A\. Rau, A\. Sivaprasad, C\. Hsieh, X\. Tang, A\. Vyas, N\. Verma, P\. Krishna, Y\. Liu, N\. Irwanto, J\. Pan, F\. Rahman, A\. Zaidi, M\. Mutuma, Y\. Tarabar, A\. Gupta, T\. Yu, Y\. C\. Tan, X\. V\. Lin, C\. Xiong, R\. Socher, and N\. F\. Rajani \(2021\)DART: open\-domain structured data record to text generation\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Online,pp\. 432–447\.External Links:[Link](https://aclanthology.org/2021.naacl-main.37),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.37)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.28.26.1)\.
- Y\. Nie, A\. Williams, E\. Dinan, M\. Bansal, J\. Weston, and D\. Kiela \(2020\)Adversarial nli: a new benchmark for natural language understanding\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.15.13.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.16.14.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.17.15.1)\.
- O\. Ostapenko, Z\. Su, E\. M\. Ponti, L\. Charlin, N\. L\. Roux, M\. Pereira, L\. Caccia, and A\. Sordoni \(2024\)Towards modular llms by building and reusing a library of loras\.arXiv preprint arXiv:2405\.11157\.Cited by:[§1](https://arxiv.org/html/2606.19079#S1.p2.1),[§2](https://arxiv.org/html/2606.19079#S2.p2.1),[§4](https://arxiv.org/html/2606.19079#S4.p2.1)\.
- A\. Rahman and V\. Ng \(2012\)Resolving complex cases of definite pronouns: the winograd schema challenge\.InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,pp\. 777–789\.Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.4.2.1)\.
- P\. Rajpurkar, R\. Jia, and P\. Liang \(2018\)Know what you don’t know: unanswerable questions for SQuAD\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 784–789\.External Links:[Link](https://aclanthology.org/P18-2124/),[Document](https://dx.doi.org/10.18653/v1/P18-2124)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.24.22.1),[§5](https://arxiv.org/html/2606.19079#S5.p4.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,J\. Su, K\. Duh, and X\. Carreras \(Eds\.\),Austin, Texas,pp\. 2383–2392\.External Links:[Link](https://aclanthology.org/D16-1264/),[Document](https://dx.doi.org/10.18653/v1/D16-1264)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.23.21.1),[§5](https://arxiv.org/html/2606.19079#S5.p4.1)\.
- R\. Sharma, J\. Allen, O\. Bakhshandeh, and N\. Mostafazadeh \(2018\)Tackling the story ending biases in the story cloze test\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),Melbourne, Australia,pp\. 752–757\.External Links:[Link](https://www.aclweb.org/anthology/P18-2119),[Document](https://dx.doi.org/10.18653/v1/P18-2119)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.31.29.1)\.
- R\. Socher, A\. Perelygin, J\. Wu, J\. Chuang, C\. D\. Manning, A\. Ng, and C\. Potts \(2013\)Recursive deep models for semantic compositionality over a sentiment treebank\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,Seattle, Washington, USA,pp\. 1631–1642\.External Links:[Link](https://www.aclweb.org/anthology/D13-1170)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.43.41.1),[§4](https://arxiv.org/html/2606.19079#S4.p2.1)\.
- Q\. Team \(2024\)Qwen2\.5: a party of foundation models\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5/)Cited by:[§A\.2](https://arxiv.org/html/2606.19079#A1.SS2.p1.1),[§4](https://arxiv.org/html/2606.19079#S4.p1.3)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP,pp\. 353–355\.Cited by:[§A\.5](https://arxiv.org/html/2606.19079#A1.SS5.p1.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.10.8.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.11.9.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.21.19.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.47.45.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.7.5.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.8.6.1),[§4](https://arxiv.org/html/2606.19079#S4.p2.1)\.
- L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei \(2022\)Text embeddings by weakly\-supervised contrastive pre\-training\.arXiv preprint arXiv:2212\.03533\.Cited by:[Table 4](https://arxiv.org/html/2606.19079#A1.T4),[§4](https://arxiv.org/html/2606.19079#S4.p1.3)\.
- L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei \(2024\)Multilingual e5 text embeddings: a technical report\.arXiv preprint arXiv:2402\.05672\.Cited by:[Table 4](https://arxiv.org/html/2606.19079#A1.T4)\.
- A\. Williams, N\. Nangia, and S\. Bowman \(2018\)A broad\-coverage challenge corpus for sentence understanding through inference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),External Links:[Link](http://aclweb.org/anthology/N18-1101)Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.18.16.1),[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.19.17.1),[§4](https://arxiv.org/html/2606.19079#S4.p2.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.32.30.1),[§4](https://arxiv.org/html/2606.19079#S4.p2.1)\.
- Q\. Zhang, M\. Chen, A\. Bukharin, N\. Karampatziakis, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao \(2023\)Adalora: adaptive budget allocation for parameter\-efficient fine\-tuning\.arXiv preprint arXiv:2303\.10512\.Cited by:[§2](https://arxiv.org/html/2606.19079#S2.p1.5)\.
- S\. Zhang, X\. Liu, J\. Liu, J\. Gao, K\. Duh, and B\. Van Durme \(2018\)Record: bridging the gap between human and machine commonsense reading comprehension\.arXiv preprint arXiv:1810\.12885\.Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.42.40.1)\.
- X\. Zhang, J\. Zhao, and Y\. LeCun \(2015\)Character\-level convolutional networks for text classification\.Advances in neural information processing systems28\.Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.45.43.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[Table 4](https://arxiv.org/html/2606.19079#A1.T4)\.
- Y\. Zhang, J\. Baldridge, and L\. He \(2019\)PAWS: Paraphrase Adversaries from Word Scrambling\.InProc\. of NAACL,Cited by:[Table 6](https://arxiv.org/html/2606.19079#A1.T6.1.5.3.1)\.
- Z\. Zhao, L\. Gan, G\. Wang, W\. Zhou, H\. Yang, K\. Kuang, and F\. Wu \(2024\)Loraretriever: input\-aware lora retrieval and composition for mixed tasks in the wild\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 4447–4462\.Cited by:[§1](https://arxiv.org/html/2606.19079#S1.p2.1),[§2](https://arxiv.org/html/2606.19079#S2.p2.1)\.

## Appendix AAppendix

### A\.1Full Results for adapters oracle and dynamic selection

In Table[2](https://arxiv.org/html/2606.19079#A1.T2), we report, for each task, the performance evaluated on Llama 3\.2 1B Instruct \(Base\), the Oracle \(the base model with the adapter trained on that specific task\) and the base model with the Dynamically selected adapter throughARIADNE\. For all the tasks, the performance is computed as Exact Match, with the exception of the CommonGen task, measured with Rouge\.

Table 2:Full table for the TP comparison between the 23 selected tasks for Llama3\.2 1B Instruct\. For all the tasks, the metric employed is Exact Match, with the exception of the CommonGen task, which is measured with Rouge\. This table expands the results presented in Table[1](https://arxiv.org/html/2606.19079#S4.T1.fig1)
### A\.2Performances on other backbones

Even ifARIADNEis completely decoupled from the backbone and the adapters architecture, for experimental completeness, we report the full pipeline performances for Qwen2\.5 3B InstructTeam \([2024](https://arxiv.org/html/2606.19079#bib.bib36)\)in Table[3](https://arxiv.org/html/2606.19079#A1.T3)\.

Table 3:Full table for the TP comparison between the 23 selected tasks for Qwen2\.5 3B model\. For all the tasks, the metric employed is Exact Match, with the exception of the CommonGen task, which is measured with Rouge\.
### A\.3Adapters Training Details

The adapters were instantiated with a bottleneck rank ofr=64r=64and a scaling factorα=128\\alpha=128\. Training was performed for 3 epochs with a batch size of 4 and a learning rate of5×10−55\\times 10^\{\-5\}, while all backbone parameters remained frozen\. For the CommonGen task, ROUGE is computed using the HuggingFaceevaluatelibrary \(version 0\.4\.6\)\.

### A\.4Analysis of Routing Failures

Despite high overall precision, we identify three failure modes that inform the system’s limitations:\(1\) Domain overlap: SQuAD V1 achieves 0% selection accuracy as it is consistently misrouted to the SQuAD V2 adapter\. However, the system recovers 85% of Oracle performance \(0\.64 vs 0\.75 TP\), demonstrating that semantically proximate adapters can often absorb routing errors\.\(2\) Adversarial variance: ANLI R1 and R2 exhibit lower accuracy \(0\.46 and 0\.38\) due to high intra\-task embedding variance from adversarial construction\. Interestingly, ANLI R1’s downstream TP actually exceeds the target \(0\.42 vs 0\.38\), suggesting beneficial cross\-task generalization from other NLI adapters\.\(3\) Reasoning ambiguity: Complex tasks like MultiRC are frequently misrouted to general QA centroids \(TriviaQA\) due to shared linguistic surface features\. Across all modes, errors are concentrated within the same semantic cluster, leading to "graceful degradation" rather than catastrophic failure\. A visual representation of this is reported in Figure[3](https://arxiv.org/html/2606.19079#A1.F3), where it’s shown how similar tasks often achieve the highest centroids similarity, meaning that their representations are the closest compared to other tasks\.

![Refer to caption](https://arxiv.org/html/2606.19079v1/x3.png)Figure 3:Pairwise distances between different tasks centroids\. Visual interpretation of the Graceful Degradation\.
### A\.5Embedders search

Due to the large number of existing text embedders, our selection is based upon a search on a subset of tasks\. We pick as most suitable encoder the one that yields the highest cosine similarity between the generated task representation, i\.e\. the centroids, and 20 test set samples\. We perform this study on the GLUE tasksWanget al\.\([2018](https://arxiv.org/html/2606.19079#bib.bib34)\)\. The results determining the selection are reported in Table[4](https://arxiv.org/html/2606.19079#A1.T4)\.

Table 4:Similarity scores across GLUE tasks centroids and test samples\. Bold = best, justifying our embedder choice\.†all\-MiniLM\-L12\-v2;‡Qwen/Qwen3\-Embedding\-0\.6BZhanget al\.\([2025](https://arxiv.org/html/2606.19079#bib.bib67)\);§intfloat/multilingual\-e5\-smallWanget al\.\([2024](https://arxiv.org/html/2606.19079#bib.bib68)\);¶intfloat/e5\-large\-v2Wanget al\.\([2022](https://arxiv.org/html/2606.19079#bib.bib30)\)\.
### A\.6Number of centroids ablation study

To decide the correct number of centroids and selection accuracy, we tested on a subset of 19 tasks, the selection accuracy provided by each method\. The multi\-centroid formulation is motivated by the observation that many NLP tasks exhibit substantial intra\-task embedding variance\. Adversarially constructed tasks such as ANLI produce inputs whose embeddings span multiple disjoint regions of the latent space, as illustrated in Figure[4](https://arxiv.org/html/2606.19079#A1.F4), while tasks with heterogeneous input formats, such as extractive vs\. abstractive question answering, may cluster around semantically distinct centroids even within the same task\. A single global mean collapses this structure and produces a centroid that may lie in a low\-density region, making it a poor representative of any individual input\. By partitioning each task’s distribution into local centroids,ARIADNEcaptures this multimodality explicitly\. K\-NN routing, by contrast, lacks task\-level structure entirely and conflates inter\-task proximity with intra\-task variance, explaining its intermediate performance in Table[5](https://arxiv.org/html/2606.19079#A1.T5)\.

![Refer to caption](https://arxiv.org/html/2606.19079v1/x4.png)Figure 4:T\-SNE analysis of the tasks embeddings\. Note that this visualization includes tasks from the full 44\-task pool used in the scalability study\.Table 5:The reported results in the table show the selection accuracy achieved over 19 tasks, as a way to empirically choose the best setup for dynamic selection\. As reported, the sweetspot is in the middle of having a single centroid for each task, and having no centroid as per K\-NN selection strategy\.
### A\.7Dynamic Selection Inference Cost

Since our method requires choosing an adapter at inference time, it is important to quantify the associated overhead\. The embedding step, which uses theintfloat/e5\-large\-v2model, takes on average 20\.04 ms with a standard deviation of±3\.70\\pm 3\.70ms\. The subsequent selection of the most appropriate adapter, based on the pre‑computed centroids, adds another 1\.98 ms \(±0\.10\\pm 0\.10ms\)\. Consequently, the complete selection pipeline incurs a total latency of roughly 22\.02 ms \(±3\.66\\pm 3\.66ms\)\. After this brief selection phase, inference proceeds as usual by applying the chosen adapter to the base model\.

### A\.8Full Routing results

To evaluate whetherARIADNE’s routing precision degrades as the adapter library grows, we extend the evaluation to the full set of 44 tasks, adding 21 tasks to the primary evaluation\. For these additional tasks, Table[6](https://arxiv.org/html/2606.19079#A1.T6)reports per\-task selection accuracy across all 44 tasks\.

Table 6:Adapter selection accuracy across all 44 tasks\. The upper block corresponds to the 23 primary evaluation tasks; the lower block constitutes the scalability extension\. The average is computed over all 44 tasks\.
### A\.9Training Samples Robustness

We hereby show that drastically reducing the number of training samples used to generate the centroids has a mild effect on the SA, as shown in Figure[5](https://arxiv.org/html/2606.19079#A1.F5)\. When using only 2% of the original number of training samples, the SA is 77\.1%\. This plot shows thatARIADNEis well performing even when a small number of samples is available\.

![Refer to caption](https://arxiv.org/html/2606.19079v1/x5.png)Figure 5:SA trend with less training samples\. The best performance is achieved with 500 samples, as reported in the main paper\. Here, we show that even with a small fraction \(2%\) of training samples we can still achieve good SA performances\.
### A\.10Selection Strategies

When generating the 5 centroids for each task, we employ 3 strategies:beginning,end, andrandom\. They respectively pick thennsamples from the beginning of the training set, the end of it, or pick random samples\. In our main study, we usedn=500n=500, with one centroid forbeginning, one centroid forend, and 3randomsamplings\. When the number of training samples for a task is smaller thannn, we use only one centroid\.

Similar Articles

Learning Agent Routing From Early Experience

arXiv cs.CL

This paper introduces BoundaryRouter, a training-free framework that optimizes LLM agent usage by routing queries to either lightweight inference or full agent execution based on early experience. It also presents RouteBench, a benchmark for evaluating routing performance, showing significant improvements in speed and accuracy.