Vector Linking via Cross-Model Local Isometric Consistency

arXiv cs.AI Papers

Summary

This paper introduces Vector Linking, a method for recovering correspondences between embeddings from different black-box encoders by leveraging local geometric consistency, proposing an iterative reference-based geometric embedding hashing approach using a small seed set of paired anchors.

arXiv:2605.31100v1 Announce Type: new Abstract: We study Vector Linking: given two embedding clouds produced by different black-box encoders over partially overlapping datasets, recover cross-model object correspondences using only vectors. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short-range distances are approximately preserved up to a scale factor, while long-range distances are not due to model-specific distortion. Building on this, we propose an iterative, reference-based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash-space matching, and aggregates evidence across views in a Beta-Bernoulli posterior to bootstrap high-confidence links as new anchors. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out-of-domain anchors, with applications to vector database integration and cross-model clustering. Code is available at https://github.com/DBgroup-Edinburgh/VecLinking.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:26 AM

# Vector Linking via Cross-Model Local Isometric Consistency
Source: [https://arxiv.org/html/2605.31100](https://arxiv.org/html/2605.31100)
###### Abstract

We studyVector Linking: given two embedding clouds produced by different black\-box encoders over partially overlapping datasets, recover cross\-model object correspondences using only vectors\. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short\-range distances are approximately preserved up to a scale factor, while long\-range distances are not due to model\-specific distortion\. Building on this, we propose an iterative, reference\-based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors\. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash\-space matching, and aggregates evidence across views in a Beta–Bernoulli posterior to bootstrap high\-confidence links as new anchors\. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out\-of\-domain anchors, with applications to vector database integration and cross\-model clustering\. Code is available at[https://github\.com/DBgroup\-Edinburgh/VecLinking](https://github.com/DBgroup-Edinburgh/VecLinking)\.

## 1Introduction

Information systems increasingly rely on embedding\-based retrieval: large collections of objects are mapped to vectors and indexed for similarity search\. In practice, however, embedding models evolve quickly, and different systems often adopt different fine\-tuned encoders\. As a result, practitioners are left with multiple vector indices whose representations are not directly comparable, even when those indices contain many of the same objects\. This interoperability gap hinders unified retrieval, cross\-index deduplication, joint clustering, and vector database integration\.

Vector Linking\.We study*vector linking*: recovering which vectors in two embedding clouds correspond to the same underlying object when the clouds are produced by different black\-box contrastive encoders and overlap only partially\. Formally, let𝒪1\\mathcal\{O\}\_\{1\}and𝒪2\\mathcal\{O\}\_\{2\}be two datasets of objects with*unknown*overlapΩ=𝒪1∩𝒪2\\Omega=\\mathcal\{O\}\_\{1\}\\cap\\mathcal\{O\}\_\{2\}\. Letf1f\_\{1\}andf2f\_\{2\}be two encoders, and let𝐄1=f1​\(𝒪1\)\\mathbf\{E\}\_\{1\}=f\_\{1\}\(\\mathcal\{O\}\_\{1\}\)and𝐄2=f2​\(𝒪2\)\\mathbf\{E\}\_\{2\}=f\_\{2\}\(\\mathcal\{O\}\_\{2\}\)be the resulting embedding sets\. We assume access only to𝐄1\\mathbf\{E\}\_\{1\},𝐄2\\mathbf\{E\}\_\{2\}, and a small seed set of paired anchorsS⊆M∗S\\subseteq M^\{\*\}, whereM∗=\{\(f1​\(x\),f2​\(x\)\):x∈Ω\}⊆E1×E2M^\{\*\}=\\\{\(f\_\{1\}\(x\),f\_\{2\}\(x\)\):x\\in\\Omega\\\}\\subseteq E\_\{1\}\\times E\_\{2\}\. The goal is to recover as many pairs inM∗M^\{\*\}as possible without access to the raw objects, model parameters, gradients, or retraining\.

This setting differs from standard embedding alignment in two important ways\. First, the overlap is*partial and unknown*: there is no global bijection between the two embedding sets, and the non\-overlapping regions do not simply behave like outliers\. Instead, they can substantially alter the global geometry seen by each encoder\. Second, we target a strict*post\-hoc black\-box*regime\. Many compatibility and alignment methods assume access to training data, encoder internals, or training\-time intervention; here, only static vectors are available\. Together, these two properties make a single global transformation unreliable\.

Local isometric consistency\. Our starting point is a simple but robust finding\. When we compare pairwise distances between shared objects across independently trained contrastive encoders, short distances remain strongly correlated while long\-range distances decorrelate quickly\. Equivalently, small neighborhoods are far more stable across models than the global arrangement of the embedding clouds\. Theoretically, we show that this pattern is not merely an empirical coincidence: by analyzing a localized alignment\-uniformity surrogate for contrastive learning, under standard assumptions, we show that independently trained contrastive encoders can induce locally isometric metrics up to scale\.

Geometric embedding hashing\. Motivated by this observation, we propose*Geometric Embedding Hashing*\(GEH\)\. The basic unit of GEH is a*distance\-to\-anchor signature*: given a small set of paired anchors, each vector is represented by its distances to those anchors within its own embedding space\. If two vectors correspond to the same object, and if the chosen anchors lie in their local neighborhoods, then these relative distance patterns should remain similar up to scale even when the global shapes of the two embedding clouds differ markedly\. GEH therefore compares normalized, scale\-free signatures rather than raw distances\.

A single anchor set is not locally informative for every point, so GEH does not rely on one global hash\. Instead, it repeatedly samples many small anchor subsets, or*views*, matches points independently in each induced hash space, and treats resulting matches as noisy votes\. A Beta\-Bernoulli posterior aggregates evidence across views, and high\-confidence matches are promoted as new anchors for the next round\. This multi\-view bootstrapping lets GEH grow a tiny seed set into a large correspondence set while gracefully filtering spurious collisions caused by model\-specific distortion and partial overlap\.

We evaluate GEH on multiple BEIR benchmarks and five encoder pairs spanning both API\-based and open\-weight models\. Across varying overlap ratios, seed budgets, and out\-of\-domain seed settings, GEH consistently outperforms eight linear, nonlinear, and optimal\-transport baselines using only 15 to 30 seed pairs\. For instance, with only 15 paired seeds, GEH achieves over 90% recall onFiQA\(Maia et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib25)\)forMistralandOpenAI\. We further show that the recovered links improve downstream tasks including vector database integration and cross\-model clustering\.

The results suggest that vector linking is a practical primitive for embedding interoperability, and that local geometric consistency across contrastive encoders holds key to tackle it\.

Contributions & organization\.We contribute as follows\.

- ∙\\bulletWe proposevector linking, the problem of recovering correspondences between two black\-box embedding clouds under partial, unknown overlap\.
- ∙\\bulletWe establish, both empirically and theoretically, a cross\-model*local distance consistency*property, forming the foundation of encoder\-invariant hashing \(Section[2](https://arxiv.org/html/2605.31100#S2)\)\.
- ∙\\bulletWe develop a multi\-view geometric hashing algorithm with posterior\-guided bootstrapping that accurately recovers vector links without accessing raw objects or model internals, using only tiny seeds \(Sections[3](https://arxiv.org/html/2605.31100#S3)\-[4](https://arxiv.org/html/2605.31100#S4)\)\.
- ∙\\bulletWe demonstrate accurate and robust linking across multiple benchmarks and embedding model pairs \(Section[5](https://arxiv.org/html/2605.31100#S5)\)\.
- ∙\\bulletWe further demonstrate its benefits to vector database integration and cross\-model clustering \(Section[6](https://arxiv.org/html/2605.31100#S6)\)\.

Related Work\.Geometric point set registration under partial overlap has been studied via hypothesis testing \(e\.g\., RANSAC\(Fischler & Bolles,[1981](https://arxiv.org/html/2605.31100#bib.bib11)\), TEASER\+\+\(Yang et al\.,[2020](https://arxiv.org/html/2605.31100#bib.bib45)\)\), iterative refinement \(e\.g\., ICP\(Besl & McKay,[1992](https://arxiv.org/html/2605.31100#bib.bib3)\), Go\-ICP\(Yang et al\.,[2016](https://arxiv.org/html/2605.31100#bib.bib46)\)\), and invariant signatures \(geometric hashing\(Lamdan & Wolfson,[1988](https://arxiv.org/html/2605.31100#bib.bib20)\)\)\. These tools are primarily designed for 3D rigid space and cannot handle high\-dimensional heteroscedastic model\-induced distortion and unknown overlap that vector linking targets\.

Embedding alignment methods for*e\.g\.,*bilingual lexicon induction, learn a global mapping between spaces \(linear/Procrustes, OT/GW\)\(Mikolov et al\.,[2013](https://arxiv.org/html/2605.31100#bib.bib26); Xing et al\.,[2015](https://arxiv.org/html/2605.31100#bib.bib43); Smith et al\.,[2017](https://arxiv.org/html/2605.31100#bib.bib34); Lample et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib21); Artetxe et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib2); Alvarez\-Melis & Jaakkola,[2018](https://arxiv.org/html/2605.31100#bib.bib1); Grave et al\.,[2019](https://arxiv.org/html/2605.31100#bib.bib15)\), often relying on an approximate global isomorphism\. Such global consistency has also been exploited for domain adaptation\(Shen et al\.,[2021](https://arxiv.org/html/2605.31100#bib.bib33); Hu et al\.,[2022](https://arxiv.org/html/2605.31100#bib.bib17); Wang & Mahadevan,[2011](https://arxiv.org/html/2605.31100#bib.bib40); Wang et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib41); Ganin et al\.,[2016](https://arxiv.org/html/2605.31100#bib.bib12); Hoffman et al\.,[2017](https://arxiv.org/html/2605.31100#bib.bib16)\), which further demand training\-time access not available in the black\-box vector linking setting\.

Unlike embedding alignment that seeks a coupling that makes theentirespaces globally comparable, vector linking instead seeks apartialone\-to\-one correspondence relation on the unknown shared support, while leaving vectors outside the overlap unmatched\. This creates an objective mismatch for global alignment, as there is no global bijection to recover\. Further, non\-overlapping regions are structured and potentially large, so they do not behave like removable outliers\. Alignment can thus improve global fit on unmatched regions while worsening correspondences on the overlap\.

Vector linking bootstraps downstream interoperability tasks such as cross\-model vector database integration\(Yang et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib44)\)and joint clustering\(Enevoldsen et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib10)\), which assumes that that reliable cross\-model anchor pairs are already known\. Vector linking addresses this assumption by recovering correspondences from black\-box vector clouds\.

Conflict of Interest Disclosure\. The authors declare no financial conflicts of interest\. All authors are affiliated solely with academic institutions and the embedding models evaluated in this work are independent third\-party systems\.

## 2Foundation of Embedding Hashing

This section provides the geometric foundation behind the idea of encoder\-invariant geometric hashing\. We first establish an empirical short\-to\-long range transition in cross\-model distance consistency \(Section[2\.1](https://arxiv.org/html/2605.31100#S2.SS1)\)\. We then provide a localized geometric explanation for why contrastive encoders tend to preserve local geometry \(Section[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)\)\.

### 2\.1Emergence of Local Distance Consistency

We begin by quantifying how pairwise euclidean distances compare across embedding spaces\. Let𝐄1\\mathbf\{E\}\_\{1\}and𝐄2\\mathbf\{E\}\_\{2\}be embeddings of the same raw datasetDDproduced by two different encoders \(*e\.g\.,*Mistralvs\.OpenAI\)\. We sample pairs\(u,v\)\(u,v\)of objects fromDD, computed𝐄1​\(u,v\)d\_\{\\mathbf\{E\}\_\{1\}\}\(u,v\)andd𝐄2​\(u,v\)d\_\{\\mathbf\{E\}\_\{2\}\}\(u,v\), and bin pairs byd𝐄1​\(u,v\)d\_\{\\mathbf\{E\}\_\{1\}\}\(u,v\)\. We report, per bin, the Pearson correlation betweend𝐄1​\(u,v\)d\_\{\\mathbf\{E\}\_\{1\}\}\(u,v\)andd𝐄2​\(u,v\)d\_\{\\mathbf\{E\}\_\{2\}\}\(u,v\)in Fig\.[1](https://arxiv.org/html/2605.31100#S2.F1)across multiple BEIR benchmarks \(See[C\.1](https://arxiv.org/html/2605.31100#A3.SS1)for more\)\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x1.png)

Figure 1:Consistency \(linear correlation\)Vs\.vector distances:The x\-axis shows the pairwise distance in the reference space \(Mistral\), while the y\-axis reports the Pearson correlation \(ρ\\rho\) of these distances with their counterparts in the target space \(OpenAI\)\.Local consistency\.For short distances,*e\.g\.,*d𝐄1​\(u,v\)≲0\.57d\_\{\\mathbf\{E\}\_\{1\}\}\(u,\\allowbreak v\)\\allowbreak\\lesssim 0\.57forArguAna, the correlation is substantially positive withρ\\rhoabove 0\.8\. Further, as shown in Appendix[A\.5](https://arxiv.org/html/2605.31100#A1.SS5), we also find that top\-kkretrieval exhibits strong consistency with smallkk\(*e\.g\.,*k<10k\\allowbreak<\\allowbreak 10\) across embeddings compared to largekk, which indicates that nearby pairs under𝐄1\\mathbf\{E\}\_\{1\}tend to remain nearby under𝐄2\\mathbf\{E\}\_\{2\}\. This seems to suggest that a linear correlation in the short\-distance regime has substantially more statistical significance compared to long distances\.

Global Decorrelation \(ρ≈0\\rho\\approx 0\)\. As distances increase, the correlation decays rapidly, consistently nearing zero\. This collapse indicates that the scaling factorα\\alphais not globally constant\. While the models agree on the “shape” of local neighborhoods, they diverge significantly on the global arrangement and distribution of data objects\. This renders long\-range distances inconsistent across embedding spaces\.

This suggests that short distances are consistent across encoders, and thus a distance\-to\-anchor vector that consists of short distances can be a viable choice for encoder\-invariant geometric hashing\.We have also observed that such correlation is significant weaker for non\-contrastive encoders \(see Appendix[A\.4](https://arxiv.org/html/2605.31100#A1.SS4)for more\)\.To further confirm this, we still want to check if this is just an empirical coincidence or something fundamental to the embedding models\.

### 2\.2A Geometric Justification of Local Isometries

We give a geometric explanation for the short\-distance consistency of Fig\.[1](https://arxiv.org/html/2605.31100#S2.F1)\. We show that the phenomenon is inherent to contrastive encoders rather than an empirical coincidence\.

Geometric modeling\.We model data as a random variableXXsupported on a smoothdd\-dimensional manifoldℳ⊂ℝN\\mathcal\{M\}\\subset\\mathbb\{R\}^\{N\}with densityp​\(x\)p\(x\)*w\.r\.t\.*the intrinsic Riemannian volume measure onℳ\\mathcal\{M\}\. We denote geodesic distance bydℳ​\(x,y\)d\_\{\\mathcal\{M\}\}\(x,y\), which serves as an intrinsic notion of semantic dissimilarity between the data objects modeled byxxandyy\.

An embedding model \(*a\.k\.a\.*encoder\) is a mapf:ℳ→ℝKf:\\mathcal\{M\}\\allowbreak\\to\\allowbreak\\mathbb\{R\}^\{K\}with normalized outputsf​\(ℳ\)⊂SK−1f\(\\mathcal\{M\}\)\\subset S^\{K\-1\}\. Let the Jacobian of the encoderffatxxbeJf​\(x\)∈ℝK×dJ\_\{f\}\(x\)\\in\\mathbb\{R\}^\{K\\times d\}, whereddis the intrinsic dimension ofℳ\\mathcal\{M\}; it maps tangent vectors from the data manifold to the embedding space\. We denote the metric tensor induced by the encoderffasGf​\(x\):=Jf​\(x\)⊤​Jf​\(x\)∈ℝd×dG\_\{f\}\(x\):=J\_\{f\}\(x\)^\{\\top\}J\_\{f\}\(x\)\\in\\mathbb\{R\}^\{d\\times d\}, which characterizes how local distances are distorted by the map\. \(A1\) We assume thatffis twice differentiable and injective, and thatGf​\(x\)G\_\{f\}\(x\)is positive definite for allx∈ℳx\\in\\mathcal\{M\},*i\.e\.,*Jf​\(x\)J\_\{f\}\(x\)has full rankdd\.

Short\-range neighborhoods\.For eachx∈ℳx\\in\\mathcal\{M\}, letδℳ​\(x\)\>0\\delta\_\{\\mathcal\{M\}\}\(x\)\>0be such that wheneverdℳ​\(x,y\)<δℳ​\(x\)d\_\{\\mathcal\{M\}\}\(x,y\)<\\delta\_\{\\mathcal\{M\}\}\(x\)the shortest geodesic fromxxtoyyonℳ\\mathcal\{M\}is unique\. For suchyy, define the geodesic displacementv​\(x,y\)∈Tx​ℳv\(x,y\)\\in T\_\{x\}\\mathcal\{M\}as the tangent vector atxxpointing towardyyalong this unique shortest geodesic, normalized so that‖v​\(x,y\)‖=dℳ​\(x,y\)\\\|v\(x,y\)\\\|=d\_\{\\mathcal\{M\}\}\(x,y\)\.

Contrastive learning\. We consider encoders trained via contrastive learning with InfoNCE\-type contrastive loss objectives\. The training signal comes from: \(i\) positive pairs, which are two semantic\-preserving “views” of the same data object, and \(ii\) negative pairs, which pair unrelated data objects\. Givenx∈ℳx\\in\\mathcal\{M\}, a “positive view”x\+x^\{\+\}is sampled by a stochastic augmentation\. We assume the following\. \(A2: local positives\)dℳ​\(x,x\+\)<δℳ​\(x\)d\_\{\\mathcal\{M\}\}\(x,\\allowbreak x^\{\+\}\)\\allowbreak<\\allowbreak\\delta\_\{\\mathcal\{M\}\}\(x\)almost surely\. \(A3: local isotropy\) Following\(Dao et al\.,[2019](https://arxiv.org/html/2605.31100#bib.bib9); Wang & Isola,[2020](https://arxiv.org/html/2605.31100#bib.bib42)\), we assume that the distribution ofx\+x^\{\+\}is centered and isotropic on the local tangent space\. Specifically, fixx∈ℳx\\allowbreak\\in\\allowbreak\\mathcal\{M\}and let the geodesic displacement vectorv=v​\(x,x\+\)∈Tx​ℳv\\allowbreak=\\allowbreak v\(x,\\allowbreak x^\{\+\}\)\\in T\_\{x\}\\mathcal\{M\}\. We assume𝔼​\[v∣x\]=0\\mathbb\{E\}\[v\\allowbreak\\mid\\allowbreak x\]\\allowbreak=\\allowbreak 0and𝔼​\[v​v⊤∣x\]=c​Id\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]\\allowbreak=\\allowbreak c\\,I\_\{d\}\. \(We discuss relaxations toc=c​\(x\)c=c\(x\)in Appendix[A\.3](https://arxiv.org/html/2605.31100#A1.SS3)\.\)

Global contrastive surrogate\. We adopt the standard alignment\-uniformity perspective on contrastive learning\(Wang & Isola,[2020](https://arxiv.org/html/2605.31100#bib.bib42); Zimmermann et al\.,[2021](https://arxiv.org/html/2605.31100#bib.bib49)\): positives should be mapped close \(alignment\), while the overall representation distribution should be spread out to avoid collapse \(uniformity\)\. LetXXbe the random data point onℳ\\mathcal\{M\}and letX\+X^\{\+\}denote its positive view\. WriteZ:=f​\(X\)Z:=f\(X\)for the induced random representation\.

Following\(van den Oord et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib37); Poole et al\.,[2019](https://arxiv.org/html/2605.31100#bib.bib31)\), for InfoNCE\-like contrastive losses, alignment loss minimizes𝔼​\[‖f​\(x\)−f​\(x\+\)‖2\]\\mathbb\{E\}\\big\[\\\|f\(x\)\\allowbreak\-\\allowbreak f\(x^\{\+\}\)\\\|^\{2\}\\big\], and uniformity can be modeled by maximizing the entropy of the representation, which we interpret intrinsically onf​\(ℳ\)f\(\\mathcal\{M\}\)as the standard differential entropy\(Cover & Thomas,[2006](https://arxiv.org/html/2605.31100#bib.bib8)\)\. Specifically, letqqdenote the density ofZZonf​\(ℳ\)f\(\\mathcal\{M\}\)*w\.r\.t\.*the induceddd\-dimensional surface volume onf​\(ℳ\)f\(\\mathcal\{M\}\), and define entropyH​\(Z\):=−𝔼​\[log⁡q​\(Z\)\]H\(Z\):=\-\\mathbb\{E\}\\big\[\\log q\(Z\)\\big\]\. Then we have the implementation\-independent surrogate of contrastive loss:

ℒλ​\(f\):=𝔼​\[‖f​\(X\)−f​\(X\+\)‖2\]−λ​H​\(Z\),\\mathcal\{L\}\_\{\\lambda\}\(f\):=\\mathbb\{E\}\\big\[\\\|f\(X\)\-f\(X^\{\+\}\)\\\|^\{2\}\\big\]\-\\lambda H\(Z\),whereλ\>0\\lambda\>0is a model\-dependent coefficient that balances alignment and uniformity\. This surrogate is not identical to InfoNCE, but captures its geometric pressure toward \(i\) local alignment of positives and \(ii\) spread of representations\.

A localized geometric view\. As we focus on local geometric properties, we develop a localized view of the global contrastive loss surrogate\. Note thatℒλ​\(f\)\\mathcal\{L\}\_\{\\lambda\}\(f\)is an expectation overXX, it can be written as an average of per\-point contributions\. Hence, we can writeℒλ​\(f\)\\mathcal\{L\}\_\{\\lambda\}\(f\)equivalently as

𝔼​\[𝔼​\[‖f​\(X\)−f​\(X\+\)‖2∣X\]⏟φalign:local alignment at​X\+λ​log⁡q​\(f​\(X\)\)⏟φuni:local uniformity at​X\]\.\\mathbb\{E\}\\Big\[\\underbrace\{\\mathbb\{E\}\\big\[\\\|f\(X\)\-f\(X^\{\+\}\)\\\|^\{2\}\\mid X\\big\]\}\_\{\\varphi\_\{\\text\{align\}\}:\\text\{local alignment at \}X\}\+\\underbrace\{\\lambda\\,\\log q\(f\(X\)\)\}\_\{\\varphi\_\{\\text\{uni\}\}:\\text\{local uniformity at \}X\}\\Big\]\.
Motivated by this, we define a localized loss atx∈ℳx\\in\\mathcal\{M\}, denoted byℒλ​\(x;f\)\\mathcal\{L\}\_\{\\lambda\}\(x;f\), such thatℒλ​\(f\)=𝔼​\[ℒλ​\(X;f\)\]\\mathcal\{L\}\_\{\\lambda\}\(f\)=\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\lambda\}\(X;f\)\]; hence

ℒλ​\(x;f\):=φalign​\(X=x\)\+φuni​\(X=x\)\.\\mathcal\{L\}\_\{\\lambda\}\(x;f\)\\allowbreak:=\\allowbreak\\varphi\_\{\\text\{align\}\}\(X=x\)\\allowbreak\+\\allowbreak\\varphi\_\{\\text\{uni\}\}\(X=x\)\.Under the encoder assumption \(A1\), by area formula and change\-of\-variables\(Lee,[2003](https://arxiv.org/html/2605.31100#bib.bib22)\)we haveq​\(f​\(x\)\)=p​\(x\)/det\(Gf​\(x\)\)q\(f\(x\)\)=p\(x\)/\\sqrt\{\\det\(G\_\{f\}\(x\)\)\}\. Therefore, up to anxx\-dependent constant not involvingff, we haveℒλ​\(x;f\)=φalign​\(X=x\)−λ2​log​det\(Gf​\(x\)\)\+const​\(x\)\\mathcal\{L\}\_\{\\lambda\}\(x;\\allowbreak f\)\\allowbreak=\\allowbreak\\varphi\_\{\\text\{align\}\}\(X\\allowbreak=\\allowbreak x\)\\allowbreak\-\\allowbreak\\frac\{\\lambda\}\{2\}\\log\\det\(G\_\{f\}\(x\)\)\\allowbreak\+\\allowbreak\\text\{const\}\(x\)\. Further by the short\-range data augmentation \(A2, A3\),φalign​\(X=x\)≈c⋅tr​\(Gf​\(x\)\)\\varphi\_\{\\text\{align\}\}\(X=x\)\\approx c\\cdot\\mathrm\{tr\}\(G\_\{f\}\(x\)\)\(ignore higher\-order terms; see Corollary[A\.1](https://arxiv.org/html/2605.31100#A1.SS1)in Appendix[A](https://arxiv.org/html/2605.31100#A1)\)\. Thus the leading\-order localized geometric objective is

ℒ~λ​\(x;f\):=c⋅tr​\(Gf​\(x\)\)−λ2​log​det\(Gf​\(x\)\)\.\\widetilde\{\\mathcal\{L\}\}\_\{\\lambda\}\(x;f\):=c\\cdot\\mathrm\{tr\}\(G\_\{f\}\(x\)\)\-\\frac\{\\lambda\}\{2\}\\log\\det\(G\_\{f\}\(x\)\)\.\\vskip\-4\.30554pt
Locally optimal encoders\. We say that an encoderffislocally optimal atx∈ℳx\\in\\mathcal\{M\}\(*w\.r\.t\.*λ\\lambda\) if itsGf​\(x\)G\_\{f\}\(x\)minimizes the leading\-order localized geometric objective,*i\.e\.,*Gf​\(x\)∈arg⁡minG≻0⁡\{c⋅tr​\(G\)−λ2​log​det\(G\)\}G\_\{f\}\(x\)\\in\\arg\\min\_\{G\\succ 0\}\\left\\\{c\\cdot\\mathrm\{tr\}\(G\)\-\\frac\{\\lambda\}\{2\}\\log\\det\(G\)\\right\\\}\.

Consider a manifoldℳ\\mathcal\{M\}\. Then we show the following \(see Appendix[A](https://arxiv.org/html/2605.31100#A1)for a full proof\)\.

Theorem 1:Letf1f\_\{1\}andf2f\_\{2\}be two encoders locally optimal at pointx∈ℳx\\in\\mathcal\{M\}with parametersλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}, respectively\. For anyy∈ℳy\\in\\mathcal\{M\}withdℳ​\(x,y\)<δℳ​\(x\)d\_\{\\mathcal\{M\}\}\(x,y\)<\\delta\_\{\\mathcal\{M\}\}\(x\):

‖f1​\(x\)−f1​\(y\)‖=κ⋅‖f2​\(x\)−f2​\(y\)‖\+𝒪​\(dℳ​\(x,y\)2\),\\\|f\_\{1\}\(x\)\-f\_\{1\}\(y\)\\\|=\\kappa\\cdot\\\|f\_\{2\}\(x\)\-f\_\{2\}\(y\)\\\|\+\\mathcal\{O\}\(d\_\{\\mathcal\{M\}\}\(x,y\)^\{2\}\),whereκ=λ1/λ2\\kappa=\\sqrt\{\\lambda\_\{1\}/\\lambda\_\{2\}\}\.□\\Box

Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)is inherently local: it holds only foryywithin neighborhood radiusδℳ​\(x\)\\delta\_\{\\mathcal\{M\}\}\(x\), consistent with the empirical decorrelation at long distances in Fig\.[1](https://arxiv.org/html/2605.31100#S2.F1)\. Moreover, the same analysis extends to a relaxation of \(A3\) where the local augmentation scale is point\-dependent,*i\.e\.,*𝔼​\[v​v⊤∣x\]=c​\(x\)​Id\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]=c\(x\)I\_\{d\}, yielding a region\-dependentκ\\kappa\(see Appendix[A\.3](https://arxiv.org/html/2605.31100#A1.SS3)\)\.

## 3Encoder\-Invariant Geometric Hashing

Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)establishes a local cross\-model geometric consistency: under local optimality, two contrastive encoders preserve short\-range distances up to a scale factor\. We now translate this local property into a concretehashing frameworkfor vector linking\. The goal is to construct, for each vector, a signature that is \(approximately\) encoder\-invariant over the overlapΩ\\Omega, so that matching objects collide \(or become nearest neighbors\) in a shared hash space\.

Distance\-to\-anchor hash\. Ageometric viewofΩ\\Omegaacross𝐄1\\mathbf\{E\}\_\{1\}and𝐄2\\mathbf\{E\}\_\{2\}is a set𝒜\\mathcal\{A\}of paired vectors\{\(a1,a1′\),…,\(ak,ak′\)\}⊆𝐄1×𝐄2\\\{\(a\_\{1\},\\allowbreak a\_\{1\}^\{\\prime\}\),\\allowbreak\\dots,\\allowbreak\(a\_\{k\},\\allowbreak a\_\{k\}^\{\\prime\}\)\\\}\\allowbreak\\subseteq\\allowbreak\\mathbf\{E\}\_\{1\}\\times\\mathbf\{E\}\_\{2\}, where each pair\(aj,aj′\)\(a\_\{j\},\\allowbreak a\_\{j\}^\{\\prime\}\)encodes the same overlap object inΩ\\Omegaand is referred to as a paired anchor\. Fix a distance function𝖽𝗂𝗌𝗍​\(⋅,⋅\)\\mathsf\{dist\}\(\\cdot,\\cdot\)\. Given a view𝒜\\mathcal\{A\}, we define thedistance\-to\-anchor hashofu∈𝐄1u\\in\\mathbf\{E\}\_\{1\}andv∈𝐄2v\\in\\mathbf\{E\}\_\{2\}*w\.r\.t\.*𝒜\\mathcal\{A\}as:𝐫𝒜​\(u\):=\(𝖽𝗂𝗌𝗍​\(u,a1\),…,𝖽𝗂𝗌𝗍​\(u,ak\)\)∈ℝk\\mathbf\{r\}\_\{\\mathcal\{A\}\}\(u\)\\allowbreak:=\\allowbreak\\big\(\\mathsf\{dist\}\(u,\\allowbreak a\_\{1\}\),\\allowbreak\\dots,\\allowbreak\\mathsf\{dist\}\(u,\\allowbreak a\_\{k\}\)\\big\)\\allowbreak\\in\\allowbreak\\mathbb\{R\}^\{k\}and𝐫𝒜′​\(v\):=\(𝖽𝗂𝗌𝗍​\(v,a1′\),…,𝖽𝗂𝗌𝗍​\(v,ak′\)\)∈ℝk\\mathbf\{r\}^\{\\prime\}\_\{\\mathcal\{A\}\}\(v\)\\allowbreak:=\\allowbreak\\big\(\\mathsf\{dist\}\(v,\\allowbreak a^\{\\prime\}\_\{1\}\),\\allowbreak\\dots,\\allowbreak\\mathsf\{dist\}\(v,\\allowbreak a^\{\\prime\}\_\{k\}\)\\big\)\\allowbreak\\in\\allowbreak\\mathbb\{R\}^\{k\}, respectively\.

As Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)predicts an unknown scale factor between encoders, we compare hashes𝐫𝒜​\(u\)\\mathbf\{r\}\_\{\\mathcal\{A\}\}\(u\)and𝐫𝒜′​\(v\)\\mathbf\{r\}^\{\\prime\}\_\{\\mathcal\{A\}\}\(v\)using ascale\-freesimilarity\. A simple choice is cosine similarity over normalization:sim𝒜​\(u,v\):=⟨𝐫^𝒜​\(u\),𝐫^𝒜′​\(v\)⟩\{\\text\{\{sim\}\}\}\_\{\\mathcal\{A\}\}\(u,v\):=\\langle\\widehat\{\\mathbf\{r\}\}\_\{\\mathcal\{A\}\}\(u\),\\widehat\{\\mathbf\{r\}\}^\{\\prime\}\_\{\\mathcal\{A\}\}\(v\)\\rangle, where𝐫^𝒜​\(u\):=𝐫𝒜​\(u\)‖𝐫𝒜​\(u\)‖2\\widehat\{\\mathbf\{r\}\}\_\{\\mathcal\{A\}\}\(u\):=\\frac\{\\mathbf\{r\}\_\{\\mathcal\{A\}\}\(u\)\}\{\\\|\\mathbf\{r\}\_\{\\mathcal\{A\}\}\(u\)\\\|\_\{2\}\}and𝐫^𝒜′​\(v\):=𝐫𝒜′​\(v\)‖𝐫𝒜′​\(v\)‖2\\widehat\{\\mathbf\{r\}\}^\{\\prime\}\_\{\\mathcal\{A\}\}\(v\):=\\frac\{\\mathbf\{r\}^\{\\prime\}\_\{\\mathcal\{A\}\}\(v\)\}\{\\\|\\mathbf\{r\}^\{\\prime\}\_\{\\mathcal\{A\}\}\(v\)\\\|\_\{2\}\}\.

Locality⇒\\Rightarrowencoder\-invariant hashing\. Letx∈Ωx\\in\\Omegahave representationsu=f1​\(x\)∈𝐄1u=f\_\{1\}\(x\)\\in\\mathbf\{E\}\_\{1\}andv=f2​\(x\)∈𝐄2v=f\_\{2\}\(x\)\\in\\mathbf\{E\}\_\{2\}\. Consider a view𝒜=\{\(a1,a1′\),…,\(ak,ak′\)\}\\mathcal\{A\}=\\\{\(a\_\{1\},a^\{\\prime\}\_\{1\}\),\\dots,\(a\_\{k\},a^\{\\prime\}\_\{k\}\)\\\}whose anchors correspond to overlap objectsx1,…,xk∈Ωx\_\{1\},\\dots,x\_\{k\}\\in\\Omega\(*i\.e\.,*aj=f1​\(xj\)a\_\{j\}=f\_\{1\}\(x\_\{j\}\)andaj′=f2​\(xj\)a^\{\\prime\}\_\{j\}=f\_\{2\}\(x\_\{j\}\)\)\. If all anchors are short\-range forxx,*i\.e\.,*dℳ​\(x,xj\)<δℳ​\(x\)d\_\{\\mathcal\{M\}\}\(x,x\_\{j\}\)<\\delta\_\{\\mathcal\{M\}\}\(x\)for alljj, then applying Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)withy=xjy=x\_\{j\}yields the componentwise relation𝖽𝗂𝗌𝗍​\(u,aj\)=κ​𝖽𝗂𝗌𝗍​\(v,aj′\)\+O​\(dℳ​\(x,xj\)2\)\\mathsf\{dist\}\(u,a\_\{j\}\)=\\kappa\\,\\mathsf\{dist\}\(v,a^\{\\prime\}\_\{j\}\)\+O\(d\_\{\\mathcal\{M\}\}\(x,x\_\{j\}\)^\{2\}\)\. Hence

𝐫𝒜​\(u\)=κ​𝐫𝒜′​\(v\)\+ϵ𝒜​\(x\),‖ϵ𝒜​\(x\)‖=O​\(maxj⁡dℳ​\(x,xj\)2\),\\mathbf\{r\}\_\{\\mathcal\{A\}\}\(u\)=\\kappa\\,\\mathbf\{r\}^\{\\prime\}\_\{\\mathcal\{A\}\}\(v\)\\;\+\\;\\bm\{\\epsilon\}\_\{\\mathcal\{A\}\}\(x\),\\\|\\bm\{\\epsilon\}\_\{\\mathcal\{A\}\}\(x\)\\\|=O\\\!\\big\(\\max\_\{j\}d\_\{\\mathcal\{M\}\}\(x,x\_\{j\}\)^\{2\}\\big\),In the ideal caseϵ𝒜​\(x\)=0\\bm\{\\epsilon\}\_\{\\mathcal\{A\}\}\(x\)=0, the two hashes are exactly related by a positive scalar, so afterℓ2\\ell\_\{2\}normalization they coincide and our scale\-free similarity satisfiessim𝒜​\(u,v\)=1\{\\text\{\{sim\}\}\}\_\{\\mathcal\{A\}\}\(u,\\allowbreak v\)=1\. When anchors are sufficiently close, the second\-order remainder is small, hence the normalized hashes remain close andsim𝒜​\(u,v\)\{\\text\{\{sim\}\}\}\_\{\\mathcal\{A\}\}\(u,\\allowbreak v\)stays near11\. Therefore, distance\-to\-anchor hashing is approximately encoder\-invariant in the short\-range regime, providing a geometric basis for vector linking via hash\-space nearest neighbors\.

The locality requirement above is essential\. As shown empirically in Section[2\.1](https://arxiv.org/html/2605.31100#S2.SS1), long\-range distances decorrelate across encoders\. Therefore, if a view contains anchors that are far from the candidate vectors, those hash coordinates become dominated by model\-specific distortion and can overwhelm the signal from locally consistent distances\. This implies that global hashing with one fixed anchor set is unreliable for points that do not lie near that anchor set, which is precisely the typical case under partial, unknown overlap\.

Localizing hashes via multi\-view voting\. While distance\-to\-anchor hashes are encoder\-invariant when they are constructed from anchors in the short\-range neighborhood, it is, however, nontrivial to decide what counts as short\-range distances, as the thresholdδℳ\\delta\_\{\\mathcal\{M\}\}is unknown and data\-dependent\.

Rather than explicitly estimating the unknown locality thresholdδℳ\\delta\_\{\\mathcal\{M\}\}, we treat localitystatisticallyby sampling many small, diverse geometric views\.𝒜\(1\),…,𝒜\(T\)\\mathcal\{A\}^\{\(1\)\},\\dots,\\mathcal\{A\}^\{\(T\)\}from the current pool of paired anchors \(via bootstrapping, as we will see in Section[4](https://arxiv.org/html/2605.31100#S4)\)\. Each view induces its own hash space and proposes candidate links by nearest\-neighbor matching undersim𝒜\(t\)​\(⋅,⋅\)\{\\text\{\{sim\}\}\}\_\{\\mathcal\{A\}^\{\(t\)\}\}\(\\cdot,\\cdot\); the proposed pairs are treated as votes\. True links are supported across many views that happen to include locally relevant anchors, whereas spurious collisions caused by long\-range, model\-specific distortion are view\-dependent and rarely accumulate\. Since matching within each view uses a scale\-free similarity, this aggregation remains effective even when the proportionality factor varies over the manifold,*e\.g\.,*under the relaxed isotropy model of \(A3\),*i\.e\.,*𝔼​\[v​v⊤∣x\]=c​\(x\)​Id\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]=c\(x\)I\_\{d\}\(Appendix[A\.3](https://arxiv.org/html/2605.31100#A1.SS3)\)\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x2.png)

Figure 2:Signal \(true links\)Vs\.Noise \(spurious links\):x\-axis is accumulated votes for candidate links onArguAna\(GTEvs\.OpenAI\); y\-axis reports pair frequency \(logscaled\)\.We tested this voting mechanism in a one\-shot diagnostic onArguAnaencoded byGTE\(Li et al\.,[2023](https://arxiv.org/html/2605.31100#bib.bib23)\)andOpenAI\([OpenAI,](https://arxiv.org/html/2605.31100#bib.bib27)\)\. We sampled a fixed pool of 500 ground\-truth paired anchors from the overlap and drew 500 random views, each containing 30 anchors\. In each view we computed hashes and collected voted links via hash collisions, then tallied for each candidate pair the number of views in which it is proposed\. Fig\.[2](https://arxiv.org/html/2605.31100#S3.F2)shows the resulting vote histogram on a log scale\. We observe a sharp separation: spurious links \(red\) follow a steep exponential decay, with the vast majority receiving negligible support\. In contrast, true links \(blue\) exhibit a robust distribution with a median of 48 votes, demonstrating that stable local geometry consistency allows true links to survive across diverse views\.

## 4Bootstrapping Multi\-View Hashing

Section[3](https://arxiv.org/html/2605.31100#S3)shows that distance\-to\-anchor hashing is reliable only when a view contains anchors that are locally relevant, and we need many such views to statistically form the localized hash via voting\. We alleviate the high demand of anchors by an iterative bootstrapping framework that starts from a tiny seed set of paired anchors and grows it using multi\-view hash collisions with posterior\-guided promotion\.

Framework\. The framework, denoted by𝖦𝖤𝖧\\mathsf\{GEH\}\(GeometricEmbeddingHashing\) and shown in Fig\.[3](https://arxiv.org/html/2605.31100#S4.F3), takes as input embedding clouds𝐄1\\mathbf\{E\}\_\{1\}and𝐄2\\mathbf\{E\}\_\{2\}, and a tiny seed set of paired anchors𝒮⊆𝐄1×𝐄2\\mathcal\{S\}\\subseteq\\mathbf\{E\}\_\{1\}\\times\\mathbf\{E\}\_\{2\}known via*e\.g\.,*domain knowledge, and outputs a set of inferred linksℒT⊆𝐄1×𝐄2\\mathcal\{L\}\_\{T\}\\subseteq\\mathbf\{E\}\_\{1\}\\times\\mathbf\{E\}\_\{2\}\.

Starting withℒ0:=𝒮\\mathcal\{L\}\_\{0\}:=\\mathcal\{S\},𝖦𝖤𝖧\\mathsf\{GEH\}generatesℒT\\mathcal\{L\}\_\{T\}through iterations\. At iterationtt, it derivesℒt\\mathcal\{L\}\_\{t\}by using inferred links identified at iterationt−1t\-1inℒt−1\\mathcal\{L\}\_\{t\-1\}as hash anchors, in three steps:

- ∙\\bullet\(View generation\)It samplesmtm\_\{t\}geometric views𝒜t,1,…,𝒜t,mt\\mathcal\{A\}\_\{t,1\},\\allowbreak\\dots,\\allowbreak\\mathcal\{A\}\_\{t,m\_\{t\}\}fromℒt−1\\mathcal\{L\}\_\{t\-1\}, such that each view𝒜t,k⊆ℒt−1\\mathcal\{A\}\_\{t,k\}\\allowbreak\\subseteq\\allowbreak\\mathcal\{L\}\_\{t\-1\}is a subset of paired anchors of fixed sizests\_\{t\}\.
- ∙\\bullet\(Per\-view link proposals\)For each view𝒜t,k\\mathcal\{A\}\_\{t,k\}, it computes view\-specific distance\-to\-anchor hashes and proposes a set of candidate links𝒫t,k⊆𝐄1×𝐄2\\mathcal\{P\}\_\{t,k\}\\subseteq\\mathbf\{E\}\_\{1\}\\times\\mathbf\{E\}\_\{2\}by nearest\-neighbor matching in the hash space\.
- ∙\\bullet\(Posterior\-guided bootstrapping\)It aggregates all proposals into a confidence score for each candidate link and promotes high\-confidence links as new anchors inℒt\\mathcal\{L\}\_\{t\}\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x3.png)

Figure 3:The geometric embedding hashing \(𝖦𝖤𝖧\\mathsf\{GEH\}\) frameworkWe next instantiate these three steps in full\.

View generation\. At iterationtt, we drawmtm\_\{t\}views,𝒜t,1,…,𝒜t,mt⊆ℒt−1\\mathcal\{A\}\_\{t,1\},\\allowbreak\\dots,\\allowbreak\\mathcal\{A\}\_\{t,m\_\{t\}\}\\allowbreak\\subseteq\\allowbreak\\mathcal\{L\}\_\{t\-1\}, each of sizests\_\{t\}\. To stabilize early iterations, we include the seed set𝒮\\mathcal\{S\}in every view\. This ensures views share a reliable core signal even whenℒt−1\\mathcal\{L\}\_\{t\-1\}is small\.

View sampling\. The quality of a view depends on anchor diversity: clustered anchors yield redundant hash coordinates and poor stability\. Hence, we choose each view by greedy Furthest Point Sampling \(FPS\)\(Gonzalez,[1985](https://arxiv.org/html/2605.31100#bib.bib14)\)on one side of the paired anchors\. Letℬt−1:=ℒt−1∖𝒮\\mathcal\{B\}\_\{t\-1\}:=\\mathcal\{L\}\_\{t\-1\}\\setminus\\mathcal\{S\},*i\.e\.,*anchors bootstrapped at previous iterationt−1t\-1\. To form view𝒜t,k\\mathcal\{A\}\_\{t,k\}, FPS starts from a random element ofℬt−1\\mathcal\{B\}\_\{t\-1\}and then iteratively adds the paired anchor whose \(chosen\-side\) vector maximizes its minimum distance to the current subset\. This encourages views with widely separated anchors and improves stability \(see Appendix[B\.1](https://arxiv.org/html/2605.31100#A2.SS1)for detailed analysis\)\.

View scheduling\. As bootstrapping \(iteration\) progresses, the anchor poolℒt−1\\mathcal\{L\}\_\{t\-1\}grows\. We increase view diversity by sampling more views fromℒt−1\\mathcal\{L\}\_\{t\-1\}while reducing anchors per view\. Letgt:=\|ℒt−1\|\|𝒮\|g\_\{t\}:=\\frac\{\|\\mathcal\{L\}\_\{t\-1\}\|\}\{\|\\mathcal\{S\}\|\}\. We define a scaling factorsft:=1\+c​log⁡gt\\mathrm\{sf\}\_\{t\}:=1\+c\\log g\_\{t\}for somec≥0c\\geq 0; we set the number of views at iterationrrtomt:=⌈m0​sft⌉m\_\{t\}:=\\lceil m\_\{0\}\\,\\mathrm\{sf\}\_\{t\}\\rceiland the size of each view tost:=⌈ρ0​\|ℒt−1\|/sft⌉s\_\{t\}:=\\left\\lceil\\rho\_\{0\}\\,\|\\mathcal\{L\}\_\{t\-1\}\|/\\mathrm\{sf\}\_\{t\}\\right\\rceil, withρ0∈\(0,1\]\\rho\_\{0\}\\in\(0,1\]\.

Intuitively, this increases the number of views as anchors grow, while making each view smaller so that it preferentially reflects local geometry\. Further, this stabilizes the per\-anchor coverage,*i\.e\.,*each anchor appears in roughly a constant number of views per iteration \(see Appendix[B\.2](https://arxiv.org/html/2605.31100#A2.SS2)\)\.

Per\-view link proposal\. Given a view𝒜t,k=\{\(a1,a1′\),…,\(ast,ast′\)\}\\mathcal\{A\}\_\{t,k\}\\allowbreak=\\allowbreak\\\{\(a\_\{1\},\\allowbreak a^\{\\prime\}\_\{1\}\),\\allowbreak\\dots,\\allowbreak\(a\_\{s\_\{t\}\},\\allowbreak a^\{\\prime\}\_\{s\_\{t\}\}\)\\\}, we construct a view\-specific hash for each point and propose links by similarity search in hash space\.

Kernelized hashes\. Raw distance\-to\-anchor vectors can be dominated by far anchors where cross\-model distances are least consistent\. We therefore apply a monotone kernel that downweights large distances\. Foru∈𝐄1u\\in\\mathbf\{E\}\_\{1\}\(analogously forv∈𝐄2v\\in\\mathbf\{E\}\_\{2\}\), we use kernelized hash\(𝐡t,k​\(u\)\)j:=exp⁡\(−𝖽𝗂𝗌𝗍​\(u,aj\)σt,k\)\\big\(\\mathbf\{h\}\_\{t,k\}\(u\)\\big\)\_\{j\}:=\\exp\\\!\\left\(\-\\frac\{\\mathsf\{dist\}\(u,a\_\{j\}\)\}\{\\sigma\_\{t,k\}\}\\right\), forj=1,…,stj=1,\\allowbreak\\dots,\\allowbreak s\_\{t\}, where𝖽𝗂𝗌𝗍​\(⋅,⋅\)\\mathsf\{dist\}\(\\cdot,\\cdot\)is the cosine distance betweenℓ2\\ell\_\{2\}\-normalized embeddings andσt,k\>0\\sigma\_\{t,k\}\>0is a per\-view bandwidth by median heuristic: the median of the nonzero pairwiseℓ2\\ell\_\{2\}distances between hashes within the view\. This preserves the rank ordering induced by distances while emphasizing the short\-range regime where Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)applies\.

Mutual nearest neighbor search in hash space\. We compare hash vectors by cosine similarity inℝst\\mathbb\{R\}^\{s\_\{t\}\}\. To mitigate hubness in nearest\-neighbor search, we use CSLS\(Lample et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib21)\)with parameterkCSLSk\_\{\\mathrm\{CSLS\}\}\. Letcslst,k​\(u,v\)\\mathrm\{csls\}\_\{t,k\}\(u,v\)denote the resulting view\-specific similarity score\. We then propose only mutual nearest neighbors \(MNN\): a pair\(u,v\)\(u,v\)is proposed ifvvmaximizescslst,k​\(u,⋅\)\\mathrm\{csls\}\_\{t,k\}\(u,\\cdot\)anduumaximizescslst,k​\(⋅,v\)\\mathrm\{csls\}\_\{t,k\}\(\\cdot,v\)\. The set of all proposed pairs is denoted𝒫t,k\\mathcal\{P\}\_\{t,k\}\.At multi\-million scale, computing hashes and scoring𝖼𝗌𝗅𝗌𝒜\{\\mathsf\{csls\}\}\_\{\\mathcal\{A\}\}for all points can be costly, so we restrict hash construction and MNN to a localkNNk\_\{\\mathrm\{NN\}\}neighborhood around each view’s anchors \(Appendix[B\.3](https://arxiv.org/html/2605.31100#A2.SS3)\)\. This restriction is consistent with Section[2\.1](https://arxiv.org/html/2605.31100#S2.SS1): we save cost on candidate pairs that are outside a view’s local neighborhood which are less reliable and contribute little useful voting signal\.

Posterior\-guided bootstrapping\. This step expandsℒt−1\\mathcal\{L\}\_\{t\-1\}from tiny seeds \(ℒ0=𝒮\\mathcal\{L\}\_\{0\}=\\mathcal\{S\}\) by promoting only links that are consistently supported across many views\. This is exactly the signal/noise separation observed in Section[3](https://arxiv.org/html/2605.31100#S3)\.

Vote counts\. Storing statistics for all\|𝐄1\|⋅\|𝐄2\|\|\\mathbf\{E\}\_\{1\}\|\\cdot\|\\mathbf\{E\}\_\{2\}\|pairs is infeasible\. We track only pairs that are proposed at least once\. Define the candidate universe up to iterationttby𝒞t:=⋃r=1t⋃k=1mr𝒫r,k\\mathcal\{C\}\_\{t\}:=\\bigcup\_\{r=1\}^\{t\}\\bigcup\_\{k=1\}^\{m\_\{r\}\}\\mathcal\{P\}\_\{r,k\}\. Each view contributes one binary vote for each\(u,v\)∈𝒞t\(u,v\)\\in\\mathcal\{C\}\_\{t\}, denoted byYt,k​\(u,v\):=𝕀​\[\(u,v\)∈𝒫t,k\]Y\_\{t,k\}\(u,v\):=\\mathbb\{I\}\\big\[\(u,v\)\\in\\mathcal\{P\}\_\{t,k\}\\big\]\. Define the cumulative number of positive votes asν\(u,v\),t:=∑r=1t∑k=1mrYr,k​\(u,v\)\\nu\_\{\(u,v\),t\}:=\\sum\_\{r=1\}^\{t\}\\ \\sum\_\{k=1\}^\{m\_\{r\}\}Y\_\{r,k\}\(u,v\), and the total number of views asN≤t:=∑r=1tmrN\_\{\\leq t\}:=\\sum\_\{r=1\}^\{t\}m\_\{r\}\.

Beta–Bernoulli posterior aslink confidence score\. We model success probability of\(u,v\)∈𝒞\(u,\\allowbreak v\)\\allowbreak\\in\\allowbreak\\mathcal\{C\}asθ\(u,v\)\\theta\_\{\(u,v\)\}with an uninformativeBeta​\(1,1\)\\mathrm\{Beta\}\(1,1\)prior\. By Beta–Bernoulli conjugacy,

θ\(u,v\)∣𝒴\(u,v\),0:t∼Beta​\(1\+ν\(u,v\),t,1\+N≤t−ν\(u,v\),t\),\\theta\_\{\(u,v\)\}\\mid\\mathcal\{Y\}\_\{\(u,v\),0:t\}\\sim\\mathrm\{Beta\}\\bigl\(1\+\\nu\_\{\(u,v\),t\},\\;1\+N\_\{\\leq t\}\-\\nu\_\{\(u,v\),t\}\\bigr\),where𝒴\(u,v\),0:t:=\{Yr,k​\(u,v\)\}r=0,…,t;k=1,…,mr\\mathcal\{Y\}\_\{\(u,v\),0:t\}:=\\\{Y\_\{r,k\}\(u,v\)\\\}\_\{r=0,\\dots,t;\\,k=1,\\dots,m\_\{r\}\}\.We use the posterior meanθ^\(u,v\),t=\(1\+ν\(u,v\),t\)/\(2\+N≤t\)\\hat\{\\theta\}\_\{\(u,v\),t\}=\(1\+\\nu\_\{\(u,v\),t\}\)/\(2\+N\_\{\\leq t\}\)as the link confidence score\.

Adaptive link promotion\. To identify links without tuning a fixed threshold, we compute an iteration\-specific thresholdτt\\tau\_\{t\}using Otsu’s rule\(Otsu,[1979](https://arxiv.org/html/2605.31100#bib.bib28)\)applied to the histogram of\{θ^\(u,v\),t\}\(u,v\)∈𝒞t\\\{\\hat\{\\theta\}\_\{\(u,v\),t\}\\\}\_\{\(u,v\)\\in\\mathcal\{C\}\_\{t\}\}\. We promote all pairs withθ^\(u,v\),t≥τt\\hat\{\\theta\}\_\{\(u,v\),t\}\\allowbreak\\geq\\allowbreak\\tau\_\{t\}\. Since each vector should match at most one target vector, we enforce one\-to\-one matching by greedily selecting non\-conflicting promoted pairs in decreasingθ^\(u,v\),t\\hat\{\\theta\}\_\{\(u,v\),t\}\.

As Otsu maximizes between\-class variance,τt\\tau\_\{t\}adapts to the typically bimodal separation between consistently supported links \(high posterior\) and transient collisions \(low posterior\), concentratingℒt\\mathcal\{L\}\_\{t\}on stable, consensus\-supported links, while filtering out distortion\-driven spurious links\.

Termination & complexity\.𝖦𝖤𝖧\\mathsf\{GEH\}stops whenℒt\\mathcal\{L\}\_\{t\}stabilizes,*e\.g\.,*when no new or very few pairs are promoted for consecutive iterations\. Per view, hash construction costsO​\(\(\|𝐄1\|\+\|𝐄2\|\)​st\)O\(\(\|\\mathbf\{E\}\_\{1\}\|\+\|\\mathbf\{E\}\_\{2\}\|\)\\,s\_\{t\}\), and nearest\-neighbor search is performed insts\_\{t\}dimensions; total cost scales with the number∑t=1Tmt\\sum\_\{t=1\}^\{T\}m\_\{t\}of evaluated views\. In practicemtm\_\{t\}is small \(tens\) and per\-view cost is moderate\. \(See Appendix[C\.2](https://arxiv.org/html/2605.31100#A3.SS2)for details\.\)

For extremely large\-scale cases, the local\-neighborhood restriction bounds the hash construction toO​\(st2​kNN\)O\(s\_\{t\}^\{2\}\\,k\_\{\\mathrm\{NN\}\}\)\. The cost of identifying each view’s local set is amortized by a one\-timekk\-nearest\-neighbor index overE1∪E2E\_\{1\}\\cup E\_\{2\}of build costO​\(\|E1\|\+\|E2\|\)O\\bigl\(\|E\_\{1\}\|\+\|E\_\{2\}\|\\bigr\), reused across allTTiterations and∑tmt\\sum\_\{t\}m\_\{t\}views\. \(See Appendix[B\.3](https://arxiv.org/html/2605.31100#A2.SS3)for more details\.\)

## 5Effectiveness

We evaluate the effectiveness of𝖦𝖤𝖧\\mathsf\{GEH\}for vector linking\.

### 5\.1Experimental Setup

Benchmarks\. We used6BEIR\(Thakur et al\.,[2021](https://arxiv.org/html/2605.31100#bib.bib35)\)text retrieval benchmarks:NFCorpus,SciFact,ArguAna,SciDocs,FiQA, andFEVER\(see Table[5](https://arxiv.org/html/2605.31100#A3.T5)in Appendix[C\.1](https://arxiv.org/html/2605.31100#A3.SS1)\)\. Each benchmark provides a corpus𝒟\\mathcal\{D\}of documents and built\-in query\-answer pairs for retrieval performance evaluation\.

Vector linking setup\. Given a corpus𝒟\\mathcal\{D\}, we constructed two partially overlapping corpora𝒟1,𝒟2\\mathcal\{D\}\_\{1\},\\mathcal\{D\}\_\{2\}as follows\. We sampled an overlap setΩ⊂𝒟\\Omega\\subset\\mathcal\{D\}and split the residual𝒟∖Ω\\mathcal\{D\}\\setminus\\Omegauniformly at random into two disjoint sets𝒰1,𝒰2\\mathcal\{U\}\_\{1\},\\mathcal\{U\}\_\{2\}and set𝒟1:=Ω∪𝒰1\\mathcal\{D\}\_\{1\}:=\\Omega\\cup\\mathcal\{U\}\_\{1\}and𝒟2:=Ω∪𝒰2\\mathcal\{D\}\_\{2\}:=\\Omega\\cup\\mathcal\{U\}\_\{2\}\. We controlled the overlap level viaα:=\|𝒟1∩𝒟2\|\|𝒟1∪𝒟2\|=\|Ω\|\|Ω\|\+\|𝒰1\|\+\|𝒰2\|\\alpha:=\\frac\{\|\\mathcal\{D\}\_\{1\}\\cap\\mathcal\{D\}\_\{2\}\|\}\{\|\\mathcal\{D\}\_\{1\}\\cup\\mathcal\{D\}\_\{2\}\|\}=\\frac\{\|\\Omega\|\}\{\|\\Omega\|\+\|\\mathcal\{U\}\_\{1\}\|\+\|\\mathcal\{U\}\_\{2\}\|\}\. We embedded𝒟1\\mathcal\{D\}\_\{1\}and𝒟2\\mathcal\{D\}\_\{2\}with encodersf1f\_\{1\}andf2f\_\{2\}, respectively, to obtain the two embedding clouds:𝐄1:=f1​\(𝒟1\)\\mathbf\{E\}\_\{1\}\\allowbreak:=\\allowbreak f\_\{1\}\(\\mathcal\{D\}\_\{1\}\)and𝐄2:=f2​\(𝒟2\)\\mathbf\{E\}\_\{2\}\\allowbreak:=\\allowbreak f\_\{2\}\(\\mathcal\{D\}\_\{2\}\)\. We set the ground\-truth correspondence setM∗:=\{\(f1​\(x\),f2​\(x\)\):x∈Ω\}M^\{\*\}:=\\\{\(f\_\{1\}\(x\),f\_\{2\}\(x\)\):x\\in\\Omega\\\}\. All methods were given only𝐄1,𝐄2\\mathbf\{E\}\_\{1\},\\mathbf\{E\}\_\{2\}and were not toldΩ\\Omega, nor had access tofif\_\{i\}or𝒟i\\mathcal\{D\}\_\{i\}\.

By default, we drew a seed set𝒮⊂M∗\\mathcal\{S\}\\subset M^\{\*\}by uniformly sampling overlap items\. We report results for three seed sizes\|𝒮\|∈\{15,20,30\}\{\|\\mathcal\{S\}\|\}\\allowbreak\\in\\allowbreak\\\{15,\\allowbreak 20,\\allowbreak 30\\\}\. We also evaluated out\-of\-domain \(OOD\) seeds which are drawn from a different dataset \(Section[5\.4](https://arxiv.org/html/2605.31100#S5.SS4)\)\. In each case, all methods received the same𝒮\\mathcal\{S\}\.

Models\. We used 5 pairs of major embedding models: \(a\)Mistral\(Mistral\-embed\(Jiang et al\.,[2023](https://arxiv.org/html/2605.31100#bib.bib18)\)\)vs\.OpenAI\(Text\-embedding\-3\-small\([OpenAI,](https://arxiv.org/html/2605.31100#bib.bib27)\)\), \(b\)GTE\(GTE\-Qwen2\-7B\-instruct\(Li et al\.,[2023](https://arxiv.org/html/2605.31100#bib.bib23)\)\)vs\.Mistral, \(c\)GTEvs\.OpenAI, \(d\)Qwen\(Qwen3\-Embedding\-8B\(Zhang et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib47)\)\)vs\.KaLM\(KaLM\-Embedding\-Gemma3\-12B\(Zhao et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib48)\)\), and \(e\)Qwenvs\.OpenAI\.

Table 1:Vector linking at overlapα\\alpha=0\.3, seeds\|𝒮\|\|\\mathcal\{S\}\|=15: each cell showsprecision/recall/F1\(%\);boldindicates best per column\.Baselines\. As there is no prior method that directly tackles vector linking with partial overlap, we adapt embedding alignment methods by incorporating ideas from𝖦𝖤𝖧\\mathsf\{GEH\}: they first align𝐄1\\mathbf\{E\}\_\{1\}to𝐄2\\mathbf\{E\}\_\{2\}by supervising on𝒮\\mathcal\{S\}, yielding a shared embedding space with aligned𝐄1\\mathbf\{E\}\_\{1\}and𝐄2\\mathbf\{E\}\_\{2\}; they then use CSLS based MNN search to identify links as𝖦𝖤𝖧\\mathsf\{GEH\}does for link proposal\. Specifically, we trained 5 alignment methods:

- ∙\\bullet𝖫𝗂𝗇𝖾𝖺𝗋\\mathsf\{Linear\}: regression with MSE\(Mikolov et al\.,[2013](https://arxiv.org/html/2605.31100#bib.bib26)\)\.
- ∙\\bullet𝖢𝖢𝖠\\mathsf\{CCA\}: canonical correlation analysis\(Lu et al\.,[2015](https://arxiv.org/html/2605.31100#bib.bib24)\)\.
- ∙\\bullet𝖬𝖫𝖯\\mathsf\{MLP\}: two\-layer MLP trained with cosine loss on seeds\.
- ∙\\bullet𝖱𝖢𝖲𝖫𝖲\\mathsf\{RCSLS\}: RCSLS\(Joulin et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib19)\), a retrieval\-based linear mapping optimized via SGD\.
- ∙\\bullet𝖯𝗋𝗈𝖼\\mathsf\{Proc\}: orthogonal Procrustes alignment on seeds \(closed\-form SVD\)\(Smith et al\.,[2017](https://arxiv.org/html/2605.31100#bib.bib34)\)\.
- ∙\\bullet𝖴𝖦𝖶\\mathsf\{UGW\}: unbalanced Gromov\-Wasserstein\(Séjourné et al\.,[2021](https://arxiv.org/html/2605.31100#bib.bib32)\), quadratic OT on intra\-space distance matrices, warm\-started from a seed\-biased coupling\.

We also tested𝖠𝖮\\mathsf\{AO\}\(Anchor Optimization\(Cannistraci et al\.,[2023](https://arxiv.org/html/2605.31100#bib.bib6)\)\), which was given overlap size to adapt to the task\.

Metrics\. We measured the output of each method,*i\.e\.,*set of predicted linksMM, by \(a\)Precision:=\|M∩M∗\|\|M\|\\mathrm\{Precision\}\\allowbreak:=\\allowbreak\\frac\{\|M\\cap M^\{\*\}\|\}\{\|M\|\}, \(b\)Recall:=\|M∩M∗\|\|M∗\|\\mathrm\{Recall\}:=\\frac\{\|M\\cap M^\{\*\}\|\}\{\|M^\{\*\}\|\}, and \(c\)F1:=2​Precision⋅RecallPrecision\+Recall\\mathrm\{F1\}:=\\frac\{2\\,\\mathrm\{Precision\}\\cdot\\mathrm\{Recall\}\}\{\\mathrm\{Precision\}\+\\mathrm\{Recall\}\}\. In tables, we report all the metrics, and bold indicates the best\.

Further details can be found in Appendix[C\.1](https://arxiv.org/html/2605.31100#A3.SS1)\.

### 5\.2Performance on Vector Linking

Overall\. We first evaluated the performance of all methods for linking all 5 pairs of embedding models across all datasets exceptFEVER\(reserved for scalability test\)\. Table[1](https://arxiv.org/html/2605.31100#S5.T1)summarizes their recall, precision, and F1 with only 15 seed anchors in𝒮\\mathcal\{S\}for an overlap ratioα=30%\\alpha=30\\%\(see Tables[18](https://arxiv.org/html/2605.31100#A3.T18)\-[26](https://arxiv.org/html/2605.31100#A3.T26)in Appendix[C\.3\.1](https://arxiv.org/html/2605.31100#A3.SS3.SSS1)for a complete report\)\.

Table 2:Scalability onFEVERonMistral↔\\leftrightarrowOpenAI, single A100 80 GB GPU, overlapα=0\.3\\alpha=0\.3,\|𝒮\|=30\|\\mathcal\{S\}\|=30\.Boldmarks the best per column; runtime is end\-to\-end wall\-clock seconds\. \(𝖴𝖦𝖶\\mathsf\{UGW\}cannot complete onFEVERso it is not reported\.\)The results are very encouraging: our method𝖦𝖤𝖧\\mathsf\{GEH\}consistently outperforms all other methods across all cases by a substantial margin\. For example, onFiQA, we need only 15 seeds to recover the entire overlap betweenQwenandKaLMmodels, achieving 79\.9% in recall, 79\.8% in precision, and 79\.8% in F1\-score, respectively, while the second\-best method achieves only 1\.3%, 20\.6%, and 2\.4%\. The results for linking across other model pairs and datasets are similar\.

Table 3:Vector linking onSciFact\(Mistral↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Varying overlap and seeds\. We further evaluated the impact of overlap ratioα\\alphaand number of seed anchors\|𝒮\|\|\\mathcal\{S\}\|\. The results overSciFactdataset for linkingMistralandOpenAImodels are shown in Table[3](https://arxiv.org/html/2605.31100#S5.T3)\(see Appendix[C\.3\.1](https://arxiv.org/html/2605.31100#A3.SS3.SSS1)for more\)\.

Our method,𝖦𝖤𝖧\\mathsf\{GEH\}, is particularly robust and stable to the seed anchors; for instance, with 15 seeds, it already performed as well as it did with 30 seeds, while all other methods required more seeds to improve performance\. All methods performed better with larger overlap, but the large gap between𝖦𝖤𝖧\\mathsf\{GEH\}and others remained stable and significant\.

Scalability\. We also evaluated the scalability of𝖦𝖤𝖧\\mathsf\{GEH\}on FEVER \(5\.4 million corpus\) usingMistral↔\\leftrightarrowOpenAIembeddings, fixing the overlap ratio toα=0\.3\\alpha=0\.3, seed budget\|𝒮\|=30\|\\mathcal\{S\}\|=30\. All methods are run on a single NVIDIA A100 \(80 GB\)\. For𝖦𝖤𝖧\\mathsf\{GEH\}, the per\-view local\-set restriction described in Section[4](https://arxiv.org/html/2605.31100#S4)is activated due to the size ofFEVER\.

Table[2](https://arxiv.org/html/2605.31100#S5.T2)reports the precision, recall, and end\-to\-end wall\-clock runtime on a single A100\.𝖦𝖤𝖧\\mathsf\{GEH\}attains93\.8%93\.8\\%precision and68\.9%68\.9\\%recall in33283328s\.𝖦𝖤𝖧\\mathsf\{GEH\}remains within the same order of magnitude as the fastest baseline \(𝖢𝖢𝖠\\mathsf\{CCA\},∼\\sim2\.1×2\.1\\timesslower\) and is faster than all other baselines despite its iterative design\. The gap is large: relative to𝖢𝖢𝖠\\mathsf\{CCA\},𝖦𝖤𝖧\\mathsf\{GEH\}improves precision from5\.22%5\.22\\%to93\.8%93\.8\\%and recall from0\.61%0\.61\\%to68\.9%68\.9\\%, at only∼\\sim2\.1×2\.1\\timesthe wall\-clock cost\.

Note that CSLS\+MNN link extraction is shared by all alignment baselines, so the incremental cost of𝖦𝖤𝖧\\mathsf\{GEH\}comes only from evaluating multiple views\. Crucially, each view operates in the low\-dimensional distance\-to\-anchor hash space induced by a small anchor set, rather than re\-matching in the original embedding space\. The iterative procedure thus performs many moderate\-cost hash\-space retrievals rather than repeated dense searches over full embedding clouds\.

Table 4:Ablation of𝖦𝖤𝖧\\mathsf\{GEH\}componentsonSciDocswithMistralvs\.OpenAI, atα\\alpha=0\.15 and seeds\|𝒮\|\|\\mathcal\{S\}\|=15:𝖦𝖤𝖧\\mathsf\{GEH\}denotes the complete pipeline: FPS view sampling, Kernelized signature, adaptive view scheduling, multi\-view posterior aggregation, and bootstrapping\. Each subsequent row removes one component relative to𝖦𝖤𝖧\\mathsf\{GEH\}; the row−\-FPS & Kernelremoves both per\-view components jointly\. Each cell reports the seed\-level mean±\\pmstandard deviation \(%\) of the corresponding metric\. A†\{\\dagger\}marks cells where the seed\-level coefficient of variation \(σ/μ\\sigma/\\mu\) exceeds10%10\\%\.
### 5\.3Ablation Study

Staged ablation\. We ablate the five components of𝖦𝖤𝖧\\mathsf\{GEH\}onSciDocs\(Mistralvs\.OpenAI\) withα=0\.15\\alpha\\\!=\\\!0\.15and\|𝒮\|=15\|\\mathcal\{S\}\|\\\!=\\\!15\. For each variant we run1010independent trials, each with a fresh random seed controlling both the draw of𝒮\\mathcal\{S\}and the internal randomness of view generation, and report the across\-trial mean and standard deviation \(μ±σ\\mu\\pm\\sigma\)\.

Each variant is𝖦𝖤𝖧\\mathsf\{GEH\}but with the named component swapped for a simpler default\. The replacement defaults are:

- ∙\\bullet−\-Kernel: use the raw distance vectorr𝒜r\_\{\\mathcal\{A\}\}instead of the kernelized hashh𝒜h\_\{\\mathcal\{A\}\}\(Section[4](https://arxiv.org/html/2605.31100#S4)\)\.
- ∙\\bullet−\-FPS: draw view anchors uniformly at random fromℒt−1\\mathcal\{L\}\_\{t\-1\}instead of by FPS \(Section[4](https://arxiv.org/html/2605.31100#S4)\)\.
- ∙\\bullet−\-FPS & Kernel: both per\-view defaults applied jointly \(raw hashes plus random view draws\)\.
- ∙\\bullet−\-View scheduling: freeze the schedule at iteration zero,ρt≡ρ0\\rho\_\{t\}\\equiv\\rho\_\{0\}andmt≡m0m\_\{t\}\\equiv m\_\{0\}\(Section[4](https://arxiv.org/html/2605.31100#S4)\)\.
- ∙\\bullet−\-Multi\-view voting: use single\-view MNN proposal,mt=1m\_\{t\}\{=\}1and𝒜t,1=ℒt−1\\mathcal\{A\}\_\{t,1\}\{=\}\\mathcal\{L\}\_\{t\-1\}\(Section[3](https://arxiv.org/html/2605.31100#S3)\)\.
- ∙\\bullet−\-Bootstrapping: run a single iteration on the𝒮\\mathcal\{S\}; no anchor\-pool growth \(Section[4](https://arxiv.org/html/2605.31100#S4)\)\.

Table[4](https://arxiv.org/html/2605.31100#S5.T4)reports the results\. Bootstrapping and multi\-view voting account for most of the absolute performance\. Without bootstrapping, only the 15 seeds were available as anchors and𝖦𝖤𝖧\\mathsf\{GEH\}collapsed to1\.9/2\.1/1\.91\.9/2\.1/1\.9for Precision\(%\)/ Recall\(%\)/ F1 \(%\); without multi\-view voting, single\-view link proposals could not separate true links from distortion\-driven collisions and performance fell to24\.0/39\.4/29\.824\.0/39\.4/29\.8\.

Removing FPS or the kernel signature makes performance highly variable \(*e\.g\.,*recallσ\\sigmarose from0\.70\.7to1919–3535\): these two per\-view components act as variance reducers, FPS by spreading anchors so each view stays well\-conditioned and the kernel by suppressing the long\-range distance regime that decorrelates across encoders \(Section[2\.1](https://arxiv.org/html/2605.31100#S2.SS1)\)\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x4.png)

Figure 4:Impact of view construction and distance encoding:onSciDocswithMistralvs\.OpenAI, we compared the precision \(left\) and recall \(right\) of view strategies \(FPS, Random\), each with Kernelized or Raw distances\.Shaded areas show variance \(±1\\pm 1std\)\.View generation and hash encoding\. We further examined the view generation strategy and hash encoding in𝖦𝖤𝖧\\mathsf\{GEH\}\. With 15 fixed seeds, we vary the overlap ratio from 0\.15 to 0\.45 and compare two view generation strategies, namely,*Random*\(uniformly sampled\) and*FPS*\(furthest\-point\-sampled\), each combined with two distance encodings:*Raw*\(unprocessed distance𝖽𝗂𝗌𝗍\\mathsf\{dist\}\),*Kernelized*\(exp⁡\(−𝖽𝗂𝗌𝗍/σ\)\\exp\(\-\\mathsf\{dist\}/\\sigma\)\)\.

Fig\.[4](https://arxiv.org/html/2605.31100#S5.F4)reports the precision and recall of vector linking onSciDocswithMistralandOpenAI\. Kernelized encoding consistently dominates the Raw variant for each view strategy, especially in recall, confirming that emphasizing short\-range distances improves robustness\. For view constructions, FPS always gives the better performance, particularly in recall, indicating that dispersed geometric views are preferable in the presence of cross\-model distance distortion due to their better stability \(see analysis in Appendix[B\.1](https://arxiv.org/html/2605.31100#A2.SS1)\)\.

### 5\.4Robustness to Domain Shifts \(OOD Analysis\)

In practice, exact vectors stored in a private index may be inaccessible; users can encode a small public corpus with both models and use those embeddings as references\. To simulate this, we replace in\-domain anchors with*out\-of\-domain \(OOD\)*anchors drawn from a different dataset\.

For each target dataset, we constructMistralandOpenAIembeddings, enforce a 30% overlap, and draw 30 OOD seeds from a separate reference dataset\. Fig\.[5](https://arxiv.org/html/2605.31100#S5.F5)reports the precision and recall across all reference\-target dataset pairs: precision remains in 77\-87% range and recall mostly between 80\-97%\. Additional results across seed budgets and overlap ratios are given in Appendix[C\.3\.2](https://arxiv.org/html/2605.31100#A3.SS3.SSS2)\. This indicates that small OOD corpora are sufficient to serve as anchors for geometric hashing to link private embedding clouds\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x5.png)

Figure 5:Out\-of\-domain anchor transfer:Precision \(left\) and recall \(right\) of our method on five target datasets \(columns\) when supervised seeds are drawn from an out\-of\-domain reference dataset \(rows\)\. All runs use 30% overlap in the target and 30 OOD seeds\.

## 6Applications

Finally, we demonstrate applications of vector linking\.

Vector database integration\. We demonstrate its benefit for*vector database integration*, enabling unified retrieval across vector databases embedded by distinct encoders\.

Setup\. We used the integration protocol of\(Yang et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib44)\), which learns clustered Procrustes over known paired anchors to transform one vector database and merge it with the target database for unified querying\. Instead of assuming paired anchors are given, we used𝖦𝖤𝖧\\mathsf\{GEH\}to infer links across databases and then apply the integration protocol\. Following\(Yang et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib44)\), we evaluated the integrated database via the recall@100 and NDCG@100 of benchmark queries\.

Baselines\. We compared against \(i\)*Random*, random anchor pairing; \(ii\)*Seed*, mapping learned from seed anchor pairs only; and \(iii\)*Union*retrieval without cross\-space mapping \(directly taking the union of two databases\)\. As an optimal reference, we also used a single model to re\-embed the full unioned corpus encoded by the two databases, where the retrieval performance is the theoretical upper limit\.

Results\. Using the split in Section[5](https://arxiv.org/html/2605.31100#S5)with answer\-free overlap \(Appendix[D\.1\.1](https://arxiv.org/html/2605.31100#A4.SS1.SSS1)\), we evaluate query performance of integratingMistralandOpenAIdatabases, with overlap ratioα\\alphavarying from 5% to 40% and 30 seeds\. Figure[6](https://arxiv.org/html/2605.31100#S6.F6)reports results overSciFact\. Our method substantially outperforms all baselines on both Recall@100 and NDCG@100, with performance improving as overlap increases, approaching to the theoretical limit of usingMistralorOpenAIalone\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x6.png)

Figure 6:Integrated vector database retrieval performance:overSciFact, Recall@100 \(left\) and NDCG@100 \(right\) vs\. overlap ratioα\\alpha, where the overlap contains*no benchmark answers*\.MistralandOpenAIare the theoretical upper limit of retrieval quality where we embed all objects with one single model\.Cross\-model clustering\. We also demonstratecross\-model clustering\. Using vector linking, we stitched the two embedding sets and run clustering detection to recover clusters spanning across datasets\. Our method recovers consistent cross\-embedding cluster assignments for7575–9898% of overlapping objects, and achieves cluster quality within≈1%\\approx 1\\%of the clusters obtained when all objects are embedded by a single encoder \(see Appendix[D](https://arxiv.org/html/2605.31100#A4)for details\)\.

## 7Conclusion

We introduced vector linking, recovering cross\-model correspondences from two black\-box embedding clouds under partial, unknown overlap\. Our core observation is that independently trained contrastive encoders exhibitlocalcross\-model geometric consistency\. This motivates encoder\-invariant geometric hashing based on distance\-to\-anchor signatures, and we instantiate it with a multi\-view iterative algorithm that bootstraps a large anchor pool from a tiny seed set that promotes short\-range distances\. Experiments across multiple benchmarks and model pairs show robust, high\-accuracy linking and enable downstream tasks such as vector database integration and cross\-model clustering\.

Future work includes reducing seed assumptions, extending to multi\-model linking, and studying when local consistency holds beyond the current contrastive surrogate\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## Acknowledgements

We thank the anonymous ICML reviewers and area chair for their constructive feedback\. This work is supported by RAEng RF\\201920\\19\\319 and the Huawei\-Edinburgh Joint Lab\.

## References

- Alvarez\-Melis & Jaakkola \(2018\)Alvarez\-Melis, D\. and Jaakkola, T\. S\.Gromov\-wasserstein alignment of word embedding spaces\.In*EMNLP*, pp\. 1881–1890, 2018\.
- Artetxe et al\. \(2018\)Artetxe, M\., Labaka, G\., and Agirre, E\.A robust self\-learning method for fully unsupervised cross\-lingual mappings of word embeddings\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*\. Association for Computational Linguistics, 2018\.doi:10\.18653/v1/p18\-1073\.URL[http://dx\.doi\.org/10\.18653/v1/P18\-1073](http://dx.doi.org/10.18653/v1/P18-1073)\.
- Besl & McKay \(1992\)Besl, P\. and McKay, N\. D\.A method for registration of 3\-d shapes\.*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 14\(2\):239–256, 1992\.doi:10\.1109/34\.121791\.
- Bojanowski et al\. \(2016\)Bojanowski, P\., Grave, E\., Joulin, A\., and Mikolov, T\.Enriching word vectors with subword information\.*arXiv preprint arXiv:1607\.04606*, 2016\.
- Boteva et al\. \(2016\)Boteva, V\., Gholipour, D\., Sokolov, A\., and Riezler, S\.A full\-text learning to rank dataset for medical information retrieval\.In*Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20–23, 2016\. Proceedings 38*, pp\. 716–722\. Springer, 2016\.
- Cannistraci et al\. \(2023\)Cannistraci, I\., Moschella, L\., Maiorca, V\., Fumero, M\., Norelli, A\., and Rodolà, E\.Bootstrapping parallel anchors for relative representations, 2023\.URL[https://arxiv\.org/abs/2303\.00721](https://arxiv.org/abs/2303.00721)\.
- Cohan et al\. \(2020\)Cohan, A\., Feldman, S\., Beltagy, I\., Downey, D\., and Weld, D\. S\.Specter: Document\-level representation learning using citation\-informed transformers\.*arXiv preprint arXiv:2004\.07180*, 2020\.
- Cover & Thomas \(2006\)Cover, T\. M\. and Thomas, J\. A\.*Elements of Information Theory*\.Wiley, Hoboken, NJ, 2 edition, 2006\.
- Dao et al\. \(2019\)Dao, T\., Gu, A\., Ratner, A\., Smith, V\., Sa, C\. D\., and Ré, C\.A kernel theory of modern data augmentation\.In*ICML*, 2019\.
- Enevoldsen et al\. \(2025\)Enevoldsen, K\., Chung, I\., Kerboua, I\., Kardos, M\., Mathur, A\., Stap, D\., Gala, J\., Siblini, W\., Krzemiński, D\., Winata, G\. I\., Sturua, S\., Utpala, S\., Ciancone, M\., Schaeffer, M\., Sequeira, G\., Misra, D\., Dhakal, S\., Rystrøm, J\., Solomatin, R\., Ömer Çağatan, Kundu, A\., Bernstorff, M\., Xiao, S\., Sukhlecha, A\., Pahwa, B\., Poświata, R\., GV, K\. K\., Ashraf, S\., Auras, D\., Plüster, B\., Harries, J\. P\., Magne, L\., Mohr, I\., Hendriksen, M\., Zhu, D\., Gisserot\-Boukhlef, H\., Aarsen, T\., Kostkan, J\., Wojtasik, K\., Lee, T\., Šuppa, M\., Zhang, C\., Rocca, R\., Hamdy, M\., Michail, A\., Yang, J\., Faysse, M\., Vatolin, A\., Thakur, N\., Dey, M\., Vasani, D\., Chitale, P\., Tedeschi, S\., Tai, N\., Snegirev, A\., Günther, M\., Xia, M\., Shi, W\., Lù, X\. H\., Clive, J\., Krishnakumar, G\., Maksimova, A\., Wehrli, S\., Tikhonova, M\., Panchal, H\., Abramov, A\., Ostendorff, M\., Liu, Z\., Clematide, S\., Miranda, L\. J\., Fenogenova, A\., Song, G\., Safi, R\. B\., Li, W\.\-D\., Borghini, A\., Cassano, F\., Su, H\., Lin, J\., Yen, H\., Hansen, L\., Hooker, S\., Xiao, C\., Adlakha, V\., Weller, O\., Reddy, S\., and Muennighoff, N\.Mmteb: Massive multilingual text embedding benchmark\.*arXiv preprint arXiv:2502\.13595*, 2025\.doi:10\.48550/arXiv\.2502\.13595\.URL[https://arxiv\.org/abs/2502\.13595](https://arxiv.org/abs/2502.13595)\.
- Fischler & Bolles \(1981\)Fischler, M\. A\. and Bolles, R\. C\.Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography\.*Commun\. ACM*, 24\(6\):381–395, June 1981\.ISSN 0001\-0782\.doi:10\.1145/358669\.358692\.URL[https://doi\.org/10\.1145/358669\.358692](https://doi.org/10.1145/358669.358692)\.
- Ganin et al\. \(2016\)Ganin, Y\., Ustinova, E\., Ajakan, H\., Germain, P\., Larochelle, H\., Laviolette, F\., Marchand, M\., and Lempitsky, V\.Domain\-adversarial training of neural networks, 2016\.URL[https://arxiv\.org/abs/1505\.07818](https://arxiv.org/abs/1505.07818)\.
- Geigle et al\. \(2021\)Geigle, G\., Reimers, N\., Rücklé, A\., and Gurevych, I\.Tweac: Transformer with extendable qa agent classifiers, 2021\.URL[https://arxiv\.org/abs/2104\.07081](https://arxiv.org/abs/2104.07081)\.
- Gonzalez \(1985\)Gonzalez, T\. F\.Clustering to minimize the maximum intercluster distance\.*Theoretical Computer Science*, 38:293–306, 1985\.ISSN 0304\-3975\.doi:https://doi\.org/10\.1016/0304\-3975\(85\)90224\-5\.URL[https://www\.sciencedirect\.com/science/article/pii/0304397585902245](https://www.sciencedirect.com/science/article/pii/0304397585902245)\.
- Grave et al\. \(2019\)Grave, E\., Joulin, A\., and Berthet, Q\.Unsupervised alignment of embeddings with wasserstein procrustes\.In Chaudhuri, K\. and Sugiyama, M\. \(eds\.\),*Proceedings of the Twenty\-Second International Conference on Artificial Intelligence and Statistics*, volume 89 of*Proceedings of Machine Learning Research*, pp\. 1880–1890\. PMLR, 16–18 Apr 2019\.URL[https://proceedings\.mlr\.press/v89/grave19a\.html](https://proceedings.mlr.press/v89/grave19a.html)\.
- Hoffman et al\. \(2017\)Hoffman, J\., Tzeng, E\., Park, T\., Zhu, J\.\-Y\., Isola, P\., Saenko, K\., Efros, A\. A\., and Darrell, T\.Cycada: Cycle\-consistent adversarial domain adaptation, 2017\.URL[https://arxiv\.org/abs/1711\.03213](https://arxiv.org/abs/1711.03213)\.
- Hu et al\. \(2022\)Hu, W\., Bansal, R\., Cao, K\., Rao, N\., Subbian, K\., and Leskovec, J\.Learning backward compatible embeddings\.In*KDD*, 2022\.
- Jiang et al\. \(2023\)Jiang, A\. Q\., Sablayrolles, A\., Mensch, A\., Bamford, C\., Chaplot, D\. S\., de las Casas, D\., Bressand, F\., Lengyel, G\., Lample, G\., Saulnier, L\., Lavaud, L\. R\., Lachaux, M\.\-A\., Stock, P\., Le Scao, T\., Lavril, T\., Wang, T\., Lacroix, T\., and El Sayed, W\.Mistral 7b\.*arXiv preprint arXiv:2310\.06825*, 2023\.doi:10\.48550/arXiv\.2310\.06825\.URL[https://arxiv\.org/abs/2310\.06825](https://arxiv.org/abs/2310.06825)\.
- Joulin et al\. \(2018\)Joulin, A\., Bojanowski, P\., Mikolov, T\., Jégou, H\., and Grave, E\.Loss in translation: Learning bilingual word mapping with a retrieval criterion\.In Riloff, E\., Chiang, D\., Hockenmaier, J\., and Tsujii, J\. \(eds\.\),*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018*, pp\. 2979–2984\. Association for Computational Linguistics, 2018\.doi:10\.18653/V1/D18\-1330\.
- Lamdan & Wolfson \(1988\)Lamdan, Y\. and Wolfson, H\.Geometric hashing: A general and efficient model\-based recognition scheme\.In*\[1988 Proceedings\] Second International Conference on Computer Vision*, pp\. 238–249, 1988\.doi:10\.1109/CCV\.1988\.589995\.
- Lample et al\. \(2018\)Lample, G\., Conneau, A\., Ranzato, M\., Denoyer, L\., and Jégou, H\.Word translation without parallel data\.In*ICLR*, 2018\.
- Lee \(2003\)Lee, J\. M\.*Smooth manifolds*\.Springer, 2003\.
- Li et al\. \(2023\)Li, Z\., Zhang, X\., Zhang, Y\., Long, D\., Xie, P\., and Zhang, M\.Towards general text embeddings with multi\-stage contrastive learning\.*arXiv preprint arXiv:2308\.03281*, 2023\.
- Lu et al\. \(2015\)Lu, A\., Wang, W\., Bansal, M\., Gimpel, K\., and Livescu, K\.Deep multilingual correlation for improved word embeddings\.In Mihalcea, R\., Chai, J\., and Sarkar, A\. \(eds\.\),*Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pp\. 250–256, Denver, Colorado, May–June 2015\. Association for Computational Linguistics\.doi:10\.3115/v1/N15\-1028\.URL[https://aclanthology\.org/N15\-1028/](https://aclanthology.org/N15-1028/)\.
- Maia et al\. \(2018\)Maia, M\., Handschuh, S\., Freitas, A\., Davis, B\., McDermott, R\., Zarrouk, M\., and Balahur, A\.Www’18 open challenge: financial opinion mining and question answering\.In*Companion proceedings of the the web conference 2018*, pp\. 1941–1942, 2018\.
- Mikolov et al\. \(2013\)Mikolov, T\., Le, Q\. V\., and Sutskever, I\.Exploiting similarities among languages for machine translation\.*CoRR*, abs/1309\.4168, 2013\.
- \(27\)OpenAI\.text\-embedding\-3\-small \(model documentation\)\.[https://platform\.openai\.com/docs/models/text\-embedding\-3\-small](https://platform.openai.com/docs/models/text-embedding-3-small)\.Accessed: 2026\-01\-26\.
- Otsu \(1979\)Otsu, N\.A threshold selection method from gray\-level histograms\.*IEEE Transactions on Systems, Man, and Cybernetics*, 9\(1\):62–66, 1979\.doi:10\.1109/TSMC\.1979\.4310076\.
- Pennington et al\. \(2014\)Pennington, J\., Socher, R\., and Manning, C\.GloVe: Global vectors for word representation\.In Moschitti, A\., Pang, B\., and Daelemans, W\. \(eds\.\),*Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pp\. 1532–1543, Doha, Qatar, October 2014\. Association for Computational Linguistics\.doi:10\.3115/v1/D14\-1162\.URL[https://aclanthology\.org/D14\-1162/](https://aclanthology.org/D14-1162/)\.
- Petersen et al\. \(2008\)Petersen, K\. B\., Pedersen, M\. S\., et al\.The matrix cookbook\.*Technical University of Denmark*, 7\(15\):510, 2008\.
- Poole et al\. \(2019\)Poole, B\., Ozair, S\., van den Oord, A\., Alemi, A\. A\., and Tucker, G\.On variational bounds of mutual information\.In*ICML*, 2019\.
- Séjourné et al\. \(2021\)Séjourné, T\., Vialard, F\.\-X\., and Peyré, G\.The unbalanced gromov wasserstein distance: conic formulation and relaxation\.In*Proceedings of the 35th International Conference on Neural Information Processing Systems*, NIPS ’21, Red Hook, NY, USA, 2021\. Curran Associates Inc\.ISBN 9781713845393\.
- Shen et al\. \(2021\)Shen, Y\., Xiong, Y\., Xia, W\., and Soatto, S\.Towards backward\-compatible representation learning, 2021\.URL[https://arxiv\.org/abs/2003\.11942](https://arxiv.org/abs/2003.11942)\.
- Smith et al\. \(2017\)Smith, S\. L\., Turban, D\. H\. P\., Hamblin, S\., and Hammerla, N\. Y\.Offline bilingual word vectors, orthogonal transformations and the inverted softmax\.In*5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings*\. OpenReview\.net, 2017\.
- Thakur et al\. \(2021\)Thakur, N\., Reimers, N\., Rücklé, A\., Srivastava, A\., and Gurevych, I\.Beir: A heterogenous benchmark for zero\-shot evaluation of information retrieval models\.*arXiv preprint arXiv:2104\.08663*, 2021\.
- Thorne et al\. \(2018\)Thorne, J\., Vlachos, A\., Christodoulopoulos, C\., and Mittal, A\.Fever: a large\-scale dataset for fact extraction and verification, 2018\.URL[https://arxiv\.org/abs/1803\.05355](https://arxiv.org/abs/1803.05355)\.
- van den Oord et al\. \(2018\)van den Oord, A\., Li, Y\., and Vinyals, O\.Representation learning with contrastive predictive coding\.*arXiv:1807\.03748*, 2018\.
- Wachsmuth et al\. \(2018\)Wachsmuth, H\., Syed, S\., and Stein, B\.Retrieval of the best counterargument without prior topic knowledge\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 241–251, 2018\.
- Wadden et al\. \(2020\)Wadden, D\., Lin, S\., Lo, K\., Wang, L\. L\., van Zuylen, M\., Cohan, A\., and Hajishirzi, H\.Fact or fiction: Verifying scientific claims\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pp\. 7534–7550, Online, November 2020\. Association for Computational Linguistics\.doi:10\.18653/v1/2020\.emnlp\-main\.609\.URL[https://aclanthology\.org/2020\.emnlp\-main\.609/](https://aclanthology.org/2020.emnlp-main.609/)\.
- Wang & Mahadevan \(2011\)Wang, C\. and Mahadevan, S\.Heterogeneous domain adaptation using manifold alignment\.In*Proceedings of the Twenty\-Second International Joint Conference on Artificial Intelligence \- Volume Volume Two*, IJCAI’11, pp\. 1541–1546\. AAAI Press, 2011\.ISBN 9781577355144\.
- Wang et al\. \(2018\)Wang, J\., Feng, W\., Chen, Y\., Yu, H\., Huang, M\., and Yu, P\. S\.Visual domain adaptation with manifold embedded distribution alignment, 2018\.URL[https://arxiv\.org/abs/1807\.07258](https://arxiv.org/abs/1807.07258)\.
- Wang & Isola \(2020\)Wang, T\. and Isola, P\.Understanding contrastive representation learning through alignment and uniformity on the hypersphere\.In*ICML*, 2020\.
- Xing et al\. \(2015\)Xing, C\., Wang, D\., Liu, C\., and Lin, Y\.Normalized word embedding and orthogonal transform for bilingual word translation\.In Mihalcea, R\., Chai, J\., and Sarkar, A\. \(eds\.\),*Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pp\. 1006–1011, Denver, Colorado, May–June 2015\. Association for Computational Linguistics\.doi:10\.3115/v1/N15\-1104\.URL[https://aclanthology\.org/N15\-1104/](https://aclanthology.org/N15-1104/)\.
- Yang et al\. \(2025\)Yang, B\., Cao, Y\., and Ren, Y\.Integrating vector databases across embedding models\.In*SIGMOD*, 2025\.
- Yang et al\. \(2020\)Yang, H\., Shi, J\., and Carlone, L\.Teaser: Fast and certifiable point cloud registration, 2020\.URL[https://arxiv\.org/abs/2001\.07715](https://arxiv.org/abs/2001.07715)\.
- Yang et al\. \(2016\)Yang, J\., Li, H\., Campbell, D\., and Jia, Y\.Go\-icp: A globally optimal solution to 3d icp point\-set registration\.*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 38\(11\):2241–2254, November 2016\.ISSN 2160\-9292\.doi:10\.1109/tpami\.2015\.2513405\.URL[http://dx\.doi\.org/10\.1109/TPAMI\.2015\.2513405](http://dx.doi.org/10.1109/TPAMI.2015.2513405)\.
- Zhang et al\. \(2025\)Zhang, Y\., Li, M\., Long, D\., Zhang, X\., Lin, H\., Yang, B\., Xie, P\., Yang, A\., Liu, D\., Lin, J\., Huang, F\., and Zhou, J\.Qwen3 embedding: Advancing text embedding and reranking through foundation models\.*arXiv preprint arXiv:2506\.05176*, 2025\.
- Zhao et al\. \(2025\)Zhao, X\., Hu, X\., Shan, Z\., Huang, S\., Zhou, Y\., Zhang, X\., Sun, Z\., Liu, Z\., Li, D\., Wei, X\., Pan, Y\., Xiang, Y\., Zhang, M\., Wang, H\., Yu, J\., Hu, B\., and Zhang, M\.Kalm\-embedding\-v2: Superior training techniques and data inspire a versatile embedding model, 2025\.URL[https://arxiv\.org/abs/2506\.20923](https://arxiv.org/abs/2506.20923)\.
- Zimmermann et al\. \(2021\)Zimmermann, R\. S\., Sharma, Y\., Schneider, S\., Bethge, M\., and Brendel, W\.Contrastive learning inverts the data generating process\.In*ICML*, 2021\.

## Appendix AProofs and Additional Details of Section[2](https://arxiv.org/html/2605.31100#S2)

### A\.1Localizing the alignment termφalign\\varphi\_\{\\mathrm\{align\}\}inℒλ​\(f\)\\mathcal\{L\}\_\{\\lambda\}\(f\)

Fix an anchor pointx∈ℳx\\in\\mathcal\{M\}\. Letexpx:Tx​ℳ→ℳ\\exp\_\{x\}:T\_\{x\}\\mathcal\{M\}\\to\\mathcal\{M\}denote the Riemannian exponential map\. For any positive viewx\+x^\{\+\}satisfying \(A2\), letv:=v​\(x,x\+\)∈Tx​ℳv:=v\(x,x^\{\+\}\)\\in T\_\{x\}\\mathcal\{M\}be the geodesic displacement, sox\+=expx⁡\(v\)x^\{\+\}=\\exp\_\{x\}\(v\)and‖v‖=dℳ​\(x,x\+\)\\\|v\\\|=d\_\{\\mathcal\{M\}\}\(x,x^\{\+\}\)\. We writeO​\(‖v‖k\)O\(\\\|v\\\|^\{k\}\)to denote a scalar remainder termϵk​\(v\)\\epsilon\_\{k\}\(v\)such that there exist constantsr0\>0r\_\{0\}\>0andCCwith\|ϵk​\(v\)\|≤C​‖v‖k\|\\epsilon\_\{k\}\(v\)\|\\leq C\\\|v\\\|^\{k\}for all‖v‖≤r0\\\|v\\\|\\leq r\_\{0\}\. We use∥⋅∥\\\|\\cdot\\\|for the Euclidean norm on vectors and the corresponding induced operator norms on linear/bilinear maps, and⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\ranglefor the standard inner product onℝK\\mathbb\{R\}^\{K\}\. For a differentiable maphhbetween finite\-dimensional vector spaces,D​h​\(u\)Dh\(u\)denotes its differential \(Jacobian as a linear map\) atuu\.

Lemma A:Under assumptions \(A1\)–\(A2\),

‖f​\(x\)−f​\(x\+\)‖2=v⊤​Gf​\(x\)​v\+O​\(‖v‖3\)\.\\\|f\(x\)\-f\(x^\{\+\}\)\\\|^\{2\}\\;=\\;v^\{\\top\}G\_\{f\}\(x\)\\,v\\;\+\\;O\(\\\|v\\\|^\{3\}\)\.\\vskip\-22\.0pt□\\Box

Proof:Fixx∈ℳx\\in\\mathcal\{M\}\. Defineg​\(u\):=f​\(expx⁡\(u\)\)g\(u\):=f\(\\exp\_\{x\}\(u\)\)foruuin a neighborhood of0∈Tx​ℳ0\\in T\_\{x\}\\mathcal\{M\}\. By \(A1\) and the smoothness ofexpx\\exp\_\{x\}, the mapggisC2C^\{2\}in a neighborhood of 0\. Hence, by multivariate Taylor’s theorem, there exist constantsr0\>0r\_\{0\}\>0andCCsuch that for allu∈Tx​ℳu\\in T\_\{x\}\\mathcal\{M\}with‖u‖≤r0\\\|u\\\|\\leq r\_\{0\},

g​\(u\)=g​\(0\)\+D​g​\(0\)​u\+ρ​\(u\),‖ρ​\(u\)‖≤C​‖u‖2\.g\(u\)=g\(0\)\+Dg\(0\)\\,u\+\\rho\(u\),\\qquad\\\|\\rho\(u\)\\\|\\leq C\\\|u\\\|^\{2\}\.By the chain rule,D​g​\(0\)=D​f​\(x\)∘D​\(expx\)​\(0\)Dg\(0\)=Df\(x\)\\circ D\(\\exp\_\{x\}\)\(0\)\. SinceD​\(expx\)​\(0\)=IdTx​ℳD\(\\exp\_\{x\}\)\(0\)=\\mathrm\{Id\}\_\{T\_\{x\}\\mathcal\{M\}\}\(*e\.g\.,*\(Lee,[2003](https://arxiv.org/html/2605.31100#bib.bib22)\)\), we haveD​g​\(0\)=D​f​\(x\)Dg\(0\)=Df\(x\)\. In an orthonormal basis ofTx​ℳT\_\{x\}\\mathcal\{M\},D​f​\(x\)Df\(x\)is represented byJf​\(x\)J\_\{f\}\(x\)andGf​\(x\)=Jf​\(x\)⊤​Jf​\(x\)G\_\{f\}\(x\)=J\_\{f\}\(x\)^\{\\top\}J\_\{f\}\(x\)\. Letv=v​\(x,x\+\)v=v\(x,x^\{\+\}\)and assume‖v‖≤r0\\\|v\\\|\\leq r\_\{0\}\. Sincex\+=expx⁡\(v\)x^\{\+\}=\\exp\_\{x\}\(v\), we havef​\(x\+\)−f​\(x\)=g​\(v\)−g​\(0\)=Jf​\(x\)​v\+ρ​\(v\)f\(x^\{\+\}\)\-f\(x\)=g\(v\)\-g\(0\)=J\_\{f\}\(x\)\\,v\+\\rho\(v\)with‖ρ​\(v\)‖≤C​‖v‖2\\\|\\rho\(v\)\\\|\\leq C\\\|v\\\|^\{2\}\. Therefore,‖f​\(x\+\)−f​\(x\)‖2=‖Jf​\(x\)​v\+ρ​\(v\)‖2=‖Jf​\(x\)​v‖2\+2​⟨Jf​\(x\)​v,ρ​\(v\)⟩\+‖ρ​\(v\)‖2\\displaystyle\\\|f\(x^\{\+\}\)\-f\(x\)\\\|^\{2\}=\\\|J\_\{f\}\(x\)v\+\\rho\(v\)\\\|^\{2\}=\\\|J\_\{f\}\(x\)v\\\|^\{2\}\+2\\langle J\_\{f\}\(x\)v,\\rho\(v\)\\rangle\+\\\|\\rho\(v\)\\\|^\{2\}\. The cross term satisfies

\|2​⟨Jf​\(x\)​v,ρ​\(v\)⟩\|≤2​‖Jf​\(x\)​v‖​‖ρ​\(v\)‖≤2​‖Jf​\(x\)‖​‖v‖⋅C​‖v‖2=O​\(‖v‖3\),\\big\|2\\langle J\_\{f\}\(x\)v,\\rho\(v\)\\rangle\\big\|\\leq 2\\\|J\_\{f\}\(x\)v\\\|\\,\\\|\\rho\(v\)\\\|\\leq 2\\\|J\_\{f\}\(x\)\\\|\\,\\\|v\\\|\\cdot C\\\|v\\\|^\{2\}=O\(\\\|v\\\|^\{3\}\),and the last term satisfies‖ρ​\(v\)‖2≤C2​‖v‖4=O​\(‖v‖3\)\\\|\\rho\(v\)\\\|^\{2\}\\leq C^\{2\}\\\|v\\\|^\{4\}=O\(\\\|v\\\|^\{3\}\)as‖v‖→0\\\|v\\\|\\to 0\. Hence‖f​\(x\+\)−f​\(x\)‖2=‖Jf​\(x\)​v‖2\+O​\(‖v‖3\)=v⊤​Jf​\(x\)⊤​Jf​\(x\)​v\+O​\(‖v‖3\)=v⊤​Gf​\(x\)​v\+O​\(‖v‖3\)\\displaystyle\\\|f\(x^\{\+\}\)\-f\(x\)\\\|^\{2\}=\\\|J\_\{f\}\(x\)v\\\|^\{2\}\+O\(\\\|v\\\|^\{3\}\)=v^\{\\top\}J\_\{f\}\(x\)^\{\\top\}J\_\{f\}\(x\)v\+O\(\\\|v\\\|^\{3\}\)=v^\{\\top\}G\_\{f\}\(x\)\\,v\+O\(\\\|v\\\|^\{3\}\), which proves the lemma\.□\\Box

This immediately gives us a rewriting of the alignment term in the localized surrogate used in Section[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)\.

Corollary B:Under \(A1\)–\(A3\),

𝔼​\[‖f​\(x\)−f​\(x\+\)‖2∣x\]=c⋅tr​\(Gf​\(x\)\)\+O​\(𝔼​\[‖v‖3∣x\]\)\.\\mathbb\{E\}\[\\\|f\(x\)\-f\(x^\{\+\}\)\\\|^\{2\}\\mid x\]=c\\cdot\\mathrm\{tr\}\(G\_\{f\}\(x\)\)\+O\\big\(\\mathbb\{E\}\[\\\|v\\\|^\{3\}\\mid x\]\\big\)\.□\\Box

Proof:By Lemma[A\.1](https://arxiv.org/html/2605.31100#A1.SS1), there existr0\>0r\_\{0\}\>0,C<∞C<\\infty, and a remainderϵ​\(v\)\\epsilon\(v\)such that for all‖v‖≤r0\\\|v\\\|\\leq r\_\{0\},

‖f​\(x\)−f​\(x\+\)‖2=v⊤​Gf​\(x\)​v\+ϵ​\(v\),\|ϵ​\(v\)\|≤C​‖v‖3\.\\\|f\(x\)\-f\(x^\{\+\}\)\\\|^\{2\}=v^\{\\top\}G\_\{f\}\(x\)v\+\\epsilon\(v\),\\qquad\|\\epsilon\(v\)\|\\leq C\\\|v\\\|^\{3\}\.Taking conditional expectation givenxxand using linearity,

𝔼​\[‖f​\(x\)−f​\(x\+\)‖2∣x\]=𝔼​\[v⊤​Gf​\(x\)​v∣x\]\+𝔼​\[ϵ​\(v\)∣x\]\.\\mathbb\{E\}\[\\\|f\(x\)\-f\(x^\{\+\}\)\\\|^\{2\}\\mid x\]=\\mathbb\{E\}\[v^\{\\top\}G\_\{f\}\(x\)v\\mid x\]\+\\mathbb\{E\}\[\\epsilon\(v\)\\mid x\]\.Moreover,

\|𝔼\[ϵ\(v\)∣x\]\|≤𝔼\[\|ϵ\(v\)\|∣x\]≤C𝔼\[∥v∥3∣x\],\|\\mathbb\{E\}\[\\epsilon\(v\)\\mid x\]\|\\leq\\mathbb\{E\}\[\|\\epsilon\(v\)\|\\mid x\]\\leq C\\,\\mathbb\{E\}\[\\\|v\\\|^\{3\}\\mid x\],so𝔼​\[ϵ​\(v\)∣x\]=O​\(𝔼​\[‖v‖3∣x\]\)\\mathbb\{E\}\[\\epsilon\(v\)\\mid x\]=O\(\\mathbb\{E\}\[\\\|v\\\|^\{3\}\\mid x\]\)\.

For the quadratic form, notev⊤​Gf​\(x\)​v=tr​\(Gf​\(x\)​v​v⊤\)v^\{\\top\}G\_\{f\}\(x\)v=\\mathrm\{tr\}\(G\_\{f\}\(x\)vv^\{\\top\}\), hence

𝔼​\[v⊤​Gf​\(x\)​v∣x\]=tr​\(Gf​\(x\)​𝔼​\[v​v⊤∣x\]\)\.\\mathbb\{E\}\[v^\{\\top\}G\_\{f\}\(x\)v\\mid x\]=\\mathrm\{tr\}\\\!\\left\(G\_\{f\}\(x\)\\,\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]\\right\)\.Under \(A3\),𝔼​\[v​v⊤∣x\]=c​Id\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]=cI\_\{d\}, so𝔼​\[v⊤​Gf​\(x\)​v∣x\]=c⋅tr​\(Gf​\(x\)\)\\mathbb\{E\}\[v^\{\\top\}G\_\{f\}\(x\)v\\mid x\]=c\\cdot\\mathrm\{tr\}\(G\_\{f\}\(x\)\)\. Combining the above yields the claim\.□\\Box

We will also use a variant of Lemma[A\.1](https://arxiv.org/html/2605.31100#A1.SS1)stated as follows:

Corollary C:Under \(A1\), foryyin a sufficiently small normal neighborhood ofxxandv=expx−1⁡\(y\)v=\\exp\_\{x\}^\{\-1\}\(y\), we have

‖f​\(y\)−f​\(x\)‖=v⊤​Gf​\(x\)​v\+O​\(‖v‖2\)\.\\\|f\(y\)\-f\(x\)\\\|=\\sqrt\{v^\{\\top\}G\_\{f\}\(x\)v\}\+O\(\\\|v\\\|^\{2\}\)\.\\vskip\-21\.52771pt□\\Box

Proof:The proof of Lemma[A\.1](https://arxiv.org/html/2605.31100#A1.SS1)is purely local and uses only that the second point lies in a normal neighborhood ofxx\. Hence the same argument applies withx\+x^\{\+\}replaced byyy:

‖f​\(y\)−f​\(x\)‖2=v⊤​Gf​\(x\)​v\+O​\(‖v‖3\)\.\\\|f\(y\)\-f\(x\)\\\|^\{2\}=v^\{\\top\}G\_\{f\}\(x\)v\+O\(\\\|v\\\|^\{3\}\)\.Writea​\(v\):=v⊤​Gf​\(x\)​va\(v\):=v^\{\\top\}G\_\{f\}\(x\)vand letr​\(v\)=O​\(‖v‖3\)r\(v\)=O\(\\\|v\\\|^\{3\}\)denote the scalar remainder so that‖f​\(y\)−f​\(x\)‖2=a​\(v\)\+r​\(v\)\\\|f\(y\)\-f\(x\)\\\|^\{2\}=a\(v\)\+r\(v\)\. SinceGf​\(x\)≻0G\_\{f\}\(x\)\\succ 0by \(A1\), there existsm\>0m\>0such thata​\(v\)≥m​‖v‖2a\(v\)\\geq m\\\|v\\\|^\{2\}for allvv\. For sufficiently small‖v‖\\\|v\\\|,a​\(v\)\+r​\(v\)≥m2​‖v‖2a\(v\)\+r\(v\)\\geq\\tfrac\{m\}\{2\}\\\|v\\\|^\{2\}\. Hence, for some constantsC,c′\>0C,c^\{\\prime\}\>0,

\|a​\(v\)\+r​\(v\)−a​\(v\)\|=\|r​\(v\)\|a​\(v\)\+r​\(v\)\+a​\(v\)≤C​‖v‖3c′​‖v‖=O​\(‖v‖2\)\.\\big\|\\sqrt\{a\(v\)\+r\(v\)\}\-\\sqrt\{a\(v\)\}\\big\|=\\frac\{\|r\(v\)\|\}\{\\sqrt\{a\(v\)\+r\(v\)\}\+\\sqrt\{a\(v\)\}\}\\leq\\frac\{C\\\|v\\\|^\{3\}\}\{c^\{\\prime\}\\\|v\\\|\}=O\(\\\|v\\\|^\{2\}\)\.This completes the proof of Corollary[A\.1](https://arxiv.org/html/2605.31100#A1.SS1)\.□\\Box

### A\.2Proof of Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)

We give a full proof of Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)\. The intuition of the proof is that, for a nearby pointyyaroundxx, the encoder admits a first\-order Taylor approximation along the unique short geodesic fromxxtoyy\. The leading term is governed by the JacobianJf​\(x\)J\_\{f\}\(x\), hence by the induced metricGf​\(x\)=Jf​\(x\)⊤​Jf​\(x\)G\_\{f\}\(x\)=J\_\{f\}\(x\)^\{\\top\}J\_\{f\}\(x\)\. Local encoder optimality forcesGf​\(x\)G\_\{f\}\(x\)to be a scalar multiple of the identity, which makes local distances proportional across encoders up to a scalar\.

Consider anyx∈ℳx\\in\\mathcal\{M\}and lety∈ℳy\\in\\mathcal\{M\}satisfydℳ​\(x,y\)<δℳ​\(x\)d\_\{\\mathcal\{M\}\}\(x,y\)<\\delta\_\{\\mathcal\{M\}\}\(x\)\. Letexpx:Tx​ℳ→ℳ\\exp\_\{x\}:T\_\{x\}\\mathcal\{M\}\\to\\mathcal\{M\}denote the Riemannian exponential map\. Since we are within the normal neighborhood ofxx, there exists a uniquev∈Tx​ℳv\\in T\_\{x\}\\mathcal\{M\}such thaty=expx⁡\(v\)y=\\exp\_\{x\}\(v\)and‖v‖=dℳ​\(x,y\)\\\|v\\\|=d\_\{\\mathcal\{M\}\}\(x,y\)\(\(Lee,[2003](https://arxiv.org/html/2605.31100#bib.bib22)\)\)\. Throughout, all𝒪​\(⋅\)\\mathcal\{O\}\(\\cdot\)terms are asy→xy\\to x\(equivalently‖v‖→0\\\|v\\\|\\to 0\)\.

\(1\) local metric minimization\. Fori∈\{1,2\}i\\in\\\{1,2\\\}, local optimality atxxmeans thatGfi​\(x\)G\_\{f\_\{i\}\}\(x\)minimizes

Φi​\(G\):=c​tr​\(G\)−λi2​log​det\(G\)over​G≻0\.\\Phi\_\{i\}\(G\):=c\\,\\mathrm\{tr\}\(G\)\-\\frac\{\\lambda\_\{i\}\}\{2\}\\log\\det\(G\)\\qquad\\text\{over \}G\\succ 0\.Note that−log​det\(G\)\-\\log\\det\(G\)is strictly convex, soΦi\\Phi\_\{i\}is strictly convex and thus has a unique minimizer\. For any symmetric directionHH, sincedd​t​tr​\(G\+t​H\)\|t=0=tr​\(H\)\\frac\{d\}\{dt\}\\mathrm\{tr\}\(G\+tH\)\\big\|\_\{t=0\}=\\mathrm\{tr\}\(H\)anddd​t​log​det\(G\+t​H\)\|t=0=tr​\(G−1​H\)\\frac\{d\}\{dt\}\\log\\det\(G\+tH\)\\big\|\_\{t=0\}=\\mathrm\{tr\}\(G^\{\-1\}H\)\(Jacobi’s formula;*e\.g\.,*\(Petersen et al\.,[2008](https://arxiv.org/html/2605.31100#bib.bib30)\)\), we have

dd​t​Φi​\(G\+t​H\)\|t=0=tr​\(\[c​Id−λi2​G−1\]​H\)\.\\frac\{d\}\{dt\}\\Phi\_\{i\}\(G\+tH\)\\Big\|\_\{t=0\}=\\mathrm\{tr\}\\\!\\left\(\\Big\[cI\_\{d\}\-\\frac\{\\lambda\_\{i\}\}\{2\}G^\{\-1\}\\Big\]H\\right\)\.At the minimizer this derivative is0for all symmetricHH\. Sincetr​\(A​H\)=0\\mathrm\{tr\}\(AH\)=0for all symmetricHHimpliesA=0A=0\(takeH=AH=A\), we obtainc​Id−λi2​G−1=0cI\_\{d\}\-\\frac\{\\lambda\_\{i\}\}\{2\}G^\{\-1\}=0\. Therefore,

Gfi​\(x\)=λi2​c​Id,i∈\{1,2\}\.G\_\{f\_\{i\}\}\(x\)=\\frac\{\\lambda\_\{i\}\}\{2c\}I\_\{d\},\\qquad i\\in\\\{1,2\\\}\.
\(2\) Local distance expansion via the induced metric\. Note that from Corollary[A\.1](https://arxiv.org/html/2605.31100#A1.SS1), fori∈\{1,2\}i\\in\\\{1,2\\\}

‖fi​\(y\)−fi​\(x\)‖=‖Jfi​\(x\)​v‖\+𝒪​\(‖v‖2\)=v⊤​Gfi​\(x\)​v\+𝒪​\(‖v‖2\)\.\\\|f\_\{i\}\(y\)\-f\_\{i\}\(x\)\\\|=\\\|J\_\{f\_\{i\}\}\(x\)v\\\|\+\\mathcal\{O\}\(\\\|v\\\|^\{2\}\)=\\sqrt\{v^\{\\top\}G\_\{f\_\{i\}\}\(x\)v\}\+\\mathcal\{O\}\(\\\|v\\\|^\{2\}\)\.SubstitutingGfi​\(x\)=λi2​c​IdG\_\{f\_\{i\}\}\(x\)=\\frac\{\\lambda\_\{i\}\}\{2c\}I\_\{d\}\(from step \(1\)\) and‖v‖=dℳ​\(x,y\)\\\|v\\\|=d\_\{\\mathcal\{M\}\}\(x,y\), we have

‖fi​\(y\)−fi​\(x\)‖=λi2​c​dℳ​\(x,y\)\+𝒪​\(dℳ​\(x,y\)2\)\.\\\|f\_\{i\}\(y\)\-f\_\{i\}\(x\)\\\|=\\sqrt\{\\frac\{\\lambda\_\{i\}\}\{2c\}\}\\,d\_\{\\mathcal\{M\}\}\(x,y\)\+\\mathcal\{O\}\(d\_\{\\mathcal\{M\}\}\(x,y\)^\{2\}\)\.
Step 3: Compare encoders\.Letκ:=λ1/λ2\\kappa:=\\sqrt\{\\lambda\_\{1\}/\\lambda\_\{2\}\}\. Then

κ​‖f2​\(y\)−f2​\(x\)‖=λ12​c​dℳ​\(x,y\)\+𝒪​\(dℳ​\(x,y\)2\)\.\\kappa\\,\\\|f\_\{2\}\(y\)\-f\_\{2\}\(x\)\\\|=\\sqrt\{\\frac\{\\lambda\_\{1\}\}\{2c\}\}\\,d\_\{\\mathcal\{M\}\}\(x,y\)\+\\mathcal\{O\}\(d\_\{\\mathcal\{M\}\}\(x,y\)^\{2\}\)\.Thus, by comparing with thei=1i=1expansion, we have

‖f1​\(y\)−f1​\(x\)‖=κ​‖f2​\(y\)−f2​\(x\)‖\+𝒪​\(dℳ​\(x,y\)2\),\\\|f\_\{1\}\(y\)\-f\_\{1\}\(x\)\\\|=\\kappa\\,\\\|f\_\{2\}\(y\)\-f\_\{2\}\(x\)\\\|\+\\mathcal\{O\}\(d\_\{\\mathcal\{M\}\}\(x,y\)^\{2\}\),which is equivalent to the stated form‖f1​\(x\)−f1​\(y\)‖=κ​‖f2​\(x\)−f2​\(y\)‖\+𝒪​\(dℳ​\(x,y\)2\)\\\|f\_\{1\}\(x\)\-f\_\{1\}\(y\)\\\|=\\kappa\\,\\\|f\_\{2\}\(x\)\-f\_\{2\}\(y\)\\\|\+\\mathcal\{O\}\(d\_\{\\mathcal\{M\}\}\(x,y\)^\{2\}\)\.

### A\.3Relaxing \(A3→\\toA3′\): Point\-Dependent Local Augmentation Assumption

Assumption \(A3\) in Section[2](https://arxiv.org/html/2605.31100#S2)simplifies the positive\-pair distribution by𝔼​\[v​v⊤∣x\]=c​Id\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]=cI\_\{d\}with a constant for allx∈ℳx\\in\\mathcal\{M\}\. In practice, the magnitude of a semantic\-preserving augmentation may also depend on the anchor point \(*e\.g\.,*some examples admit larger perturbations than others\)\. A natural relaxation is to allow the isotropic scale to vary withxx\.

A relaxed local\-isotropy model \(A3′\)\. We replace \(A3\) of Section[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)by the following point\-dependent variant, denoted by \(A3′\): Fixx∈ℳx\\in\\mathcal\{M\}and letv=v​\(x,x\+\)∈Tx​ℳv=v\(x,x^\{\+\}\)\\in T\_\{x\}\\mathcal\{M\}denote the geodesic displacement to a positive view\. Assume𝔼​\[v∣x\]=0\\mathbb\{E\}\[v\\mid x\]=0and𝔼​\[v​v⊤∣x\]=c​\(x\)​Id\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]=c\(x\)\\,I\_\{d\}for some functionc​\(⋅\):ℳ→\(0,∞\)c\(\\cdot\):\\mathcal\{M\}\\to\(0,\\infty\)\.

Effect on localized alignment termφalign\\varphi\_\{\\mathrm\{align\}\}\. By Lemma[A\.1](https://arxiv.org/html/2605.31100#A1.SS1)and taking conditional expectation, we have

𝔼​\[‖f​\(x\)−f​\(x\+\)‖2∣x\]=𝔼​\[v⊤​Gf​\(x\)​v∣x\]\+O​\(𝔼​\[‖v‖3∣x\]\)\.\\mathbb\{E\}\[\\\|f\(x\)\\allowbreak\-\\allowbreak f\(x^\{\+\}\)\\\|^\{2\}\\mid\\allowbreak x\]\\allowbreak=\\allowbreak\\mathbb\{E\}\[v^\{\\top\}G\_\{f\}\(x\)v\\mid x\]\\allowbreak\+\\allowbreak O\\big\(\\mathbb\{E\}\[\\\|v\\\|^\{3\}\\mid x\]\\big\)\.Usingv⊤​Gf​\(x\)​v=tr​\(Gf​\(x\)​v​v⊤\)v^\{\\top\}G\_\{f\}\(x\)v\\allowbreak=\\allowbreak\\mathrm\{tr\}\(G\_\{f\}\(x\)vv^\{\\top\}\)yields𝔼​\[‖f​\(x\)−f​\(x\+\)‖2∣x\]=tr​\(Gf​\(x\)​𝔼​\[v​v⊤∣x\]\)\+O​\(𝔼​\[‖v‖3∣x\]\)\\displaystyle\\mathbb\{E\}\[\\\|f\(x\)\-f\(x^\{\+\}\)\\\|^\{2\}\\mid x\]=\\mathrm\{tr\}\\\!\\big\(G\_\{f\}\(x\)\\,\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]\\big\)\+O\\big\(\\mathbb\{E\}\[\\\|v\\\|^\{3\}\\mid x\]\\big\)\.

Under \(A3′\),𝔼​\[v​v⊤∣x\]=c​\(x\)​Id\\mathbb\{E\}\[vv^\{\\top\}\\mid x\]=c\(x\)I\_\{d\}, hence

𝔼​\[‖f​\(x\)−f​\(x\+\)‖2∣x\]=c​\(x\)​tr​\(Gf​\(x\)\)\+O​\(𝔼​\[‖v‖3∣x\]\)\.\\mathbb\{E\}\[\\\|f\(x\)\-f\(x^\{\+\}\)\\\|^\{2\}\\mid x\]=c\(x\)\\,\\mathrm\{tr\}\(G\_\{f\}\(x\)\)\+O\\big\(\\mathbb\{E\}\[\\\|v\\\|^\{3\}\\mid x\]\\big\)\.Therefore the leading\-order localized objective becomes

ℒ~λ​\(x;f\)=c​\(x\)​tr​\(Gf​\(x\)\)−λ2​log​det\(Gf​\(x\)\)\.\\widetilde\{\\mathcal\{L\}\}\_\{\\lambda\}\(x;f\)=c\(x\)\\,\\mathrm\{tr\}\(G\_\{f\}\(x\)\)\-\\frac\{\\lambda\}\{2\}\\log\\det\(G\_\{f\}\(x\)\)\.
Local optimum\. Since the leading\-order local objective depends onffonly throughGf​\(x\)G\_\{f\}\(x\), we minimize overG≻0G\\succ 0to characterize the optimal local metric; we then callfflocally optimal atxxif its induced metric matches this minimizer\. Minimizing overGGyields the first\-order condition

c​\(x\)​Id−λ2​G−1=0,c\(x\)\\,I\_\{d\}\-\\frac\{\\lambda\}\{2\}G^\{\-1\}=0,so the unique minimizer isGf⋆​\(x\)=λ2​c​\(x\)​Id\\displaystyle G\_\{f\}^\{\\star\}\(x\)=\\frac\{\\lambda\}\{2c\(x\)\}\\,I\_\{d\}\.

Thus the encoder remains locally a scaled isometry onTx​ℳT\_\{x\}\\mathcal\{M\}, but the scale factor depends onxxthroughc​\(x\)c\(x\)\.

Implication on Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)\. If two encodersf1,f2f\_\{1\},f\_\{2\}satisfy the same analysis but potentially with different augmentation scalesc1​\(x\),c2​\(x\)c\_\{1\}\(x\),c\_\{2\}\(x\)and parametersλ1,λ2\\lambda\_\{1\},\\lambda\_\{2\}, then the local distance expansions become \(fori∈\{1,2\}i\\in\\\{1,2\\\}\):

‖fi​\(x\)−fi​\(y\)‖=λi2​ci​\(x\)​dℳ​\(x,y\)\+O​\(dℳ​\(x,y\)2\)\\\|f\_\{i\}\(x\)\-f\_\{i\}\(y\)\\\|=\\sqrt\{\\frac\{\\lambda\_\{i\}\}\{2c\_\{i\}\(x\)\}\}\\,d\_\{\\mathcal\{M\}\}\(x,y\)\+O\\big\(d\_\{\\mathcal\{M\}\}\(x,y\)^\{2\}\\big\)foryyin the normal neighborhood ofxx\. Eliminatingdℳ​\(x,y\)d\_\{\\mathcal\{M\}\}\(x,y\)gives a point\-dependent scaling relation

‖f1​\(x\)−f1​\(y\)‖=κ​\(x\)​‖f2​\(x\)−f2​\(y\)‖\+O​\(dℳ​\(x,y\)2\)\\\|f\_\{1\}\(x\)\-f\_\{1\}\(y\)\\\|=\\kappa\(x\)\\,\\\|f\_\{2\}\(x\)\-f\_\{2\}\(y\)\\\|\+O\\big\(d\_\{\\mathcal\{M\}\}\(x,y\)^\{2\}\\big\)withκ​\(x\)=λ1​c2​\(x\)λ2​c1​\(x\)\\displaystyle\\kappa\(x\)=\\sqrt\{\\frac\{\\lambda\_\{1\}\\,c\_\{2\}\(x\)\}\{\\lambda\_\{2\}\\,c\_\{1\}\(x\)\}\}\. In particular, if the two encoders share the same local positive\-pair distribution in the sense thatc1​\(x\)=c2​\(x\)c\_\{1\}\(x\)=c\_\{2\}\(x\)for allxx, thenκ​\(x\)\\kappa\(x\)reduces to the constantλ1/λ2\\sqrt\{\\lambda\_\{1\}/\\lambda\_\{2\}\}as stated in Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)\.

Remark\. Allowingc=c​\(x\)c=c\(x\)provides a simple mechanism for why asingle globalscale does not fit all points: even when encoders are locally conformal, they may “expand” or “contract” neighborhoods by different amounts at different anchors\. This complements the empirical observation that short\-range distances are substantially more consistent than long\-range distances, while also explaining residual variability within the short\-range regime\.

### A\.4Correlation Analysis

We report the Pearson correlation coefficientρ\\rhobetween pairwise Euclidean distances measured in the reference embedding space and the corresponding distances between the same item pairs in the target space\.Besides the contrastive encoder pairs, we also include classic word embedding modelsGloVe\(Pennington et al\.,[2014](https://arxiv.org/html/2605.31100#bib.bib29)\)andfastText\(Bojanowski et al\.,[2016](https://arxiv.org/html/2605.31100#bib.bib4)\)trained without a contrastive objective as non\-contrastive baselines\. The same two models are reused as the non\-contrastive baseline in the top\-kkJaccard analysis of §[A\.5](https://arxiv.org/html/2605.31100#A1.SS5)\.

We probe the local\-consistency / global\-decorrelation pattern of Fig\.[1](https://arxiv.org/html/2605.31100#S2.F1)along four axes:\(i\)encoder pair: six contrastive encoder pairs \(Fig\.[7](https://arxiv.org/html/2605.31100#A1.F7), panels a–f\);\(ii\)target dimensionality: withMistralas the reference, varying the target embedding dimensionality ofOpenAIonSciFact\(Fig\.[7](https://arxiv.org/html/2605.31100#A1.F7), panel g\);\(iii\)task domain:Mistral→\\rightarrowOpenAIon two clustering benchmarks \(Fig\.[7](https://arxiv.org/html/2605.31100#A1.F7), panel h\);\(iv\)encoder family: contrastive vs\. non\-contrastive on four retrieval datasets \(Fig\.[7](https://arxiv.org/html/2605.31100#A1.F7), panels i–l\), where each panel overlaysGloVe↔\\leftrightarrowfastTextwithMistral↔\\leftrightarrowOpenAI\. For \(i\)–\(iii\),ρ\\rhois high at short range and decays as the reference distance grows, indicating that the geometric signal underlying GEH is a property of the contrastive encoder family\. In contrast, the non\-contrastive pair in \(iv\) already starts at markedly lowerρ\\rhoat short range, and its long\-range tail does not always decay\. This supports our restriction of GEH’s analysis to contrastive encoders \(cf\. §[2\.1](https://arxiv.org/html/2605.31100#S2.SS1)\)\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x7.png)aGTE→\\rightarrowMistral
![Refer to caption](https://arxiv.org/html/2605.31100v1/x8.png)bGTE→\\rightarrowOpenAI
![Refer to caption](https://arxiv.org/html/2605.31100v1/x9.png)cKaLM→\\rightarrowMistral
![Refer to caption](https://arxiv.org/html/2605.31100v1/x10.png)dKaLM→\\rightarrowQwen
![Refer to caption](https://arxiv.org/html/2605.31100v1/x11.png)eQwen→\\rightarrowOpenAI
![Refer to caption](https://arxiv.org/html/2605.31100v1/x12.png)fKaLM→\\rightarrowOpenAI
![Refer to caption](https://arxiv.org/html/2605.31100v1/x13.png)gOpenAIdim sweep,SciFact
![Refer to caption](https://arxiv.org/html/2605.31100v1/x14.png)hClustering tasks
![Refer to caption](https://arxiv.org/html/2605.31100v1/x15.png)iArguAna
![Refer to caption](https://arxiv.org/html/2605.31100v1/x16.png)jFiQA
![Refer to caption](https://arxiv.org/html/2605.31100v1/x17.png)kNFCorpus
![Refer to caption](https://arxiv.org/html/2605.31100v1/x18.png)lSciDocs

Figure 7:Distance consistency across embedding spaces\.Each subplot shows Pearson correlationρ\\rhobetween pairwise distances in the reference space and their counterparts in the target space, binned by the reference distance\. \(a–f\) Six contrastive encoder pairs\.\(g\)Mistral→\\rightarrowOpenAIonSciFactfor a sweep ofOpenAIdimensionalities\. \(h\)Mistral→\\rightarrowOpenAIon two clustering benchmarks\. \(i–l\) Non\-contrastive comparison: each panel overlaysGloVe↔\\leftrightarrowfastTextwithMistral↔\\leftrightarrowOpenAIon the same dataset\.
### A\.5Retrieval Result analysis

To empirically validate the local geometric consistency of embedding spaces, we analyze the consistency of top\-kkretrieval results across different embedding models\. Given two embedding sets𝐄1\\mathbf\{E\}\_\{1\}and𝐄2\\mathbf\{E\}\_\{2\}encoding the same corpus, we perform top\-kknearest neighbor retrieval for each query point in both spaces and measure their agreement using the Jaccard index:Jk=\|𝒩k\(1\)∩𝒩k\(2\)\|/\|𝒩k\(1\)∪𝒩k\(2\)\|J\_\{k\}=\|\\mathcal\{N\}\_\{k\}^\{\(1\)\}\\cap\\mathcal\{N\}\_\{k\}^\{\(2\)\}\|/\|\\mathcal\{N\}\_\{k\}^\{\(1\)\}\\cup\\mathcal\{N\}\_\{k\}^\{\(2\)\}\|, where𝒩k\(i\)\\mathcal\{N\}\_\{k\}^\{\(i\)\}denotes the set ofkknearest neighbors in embedding spaceii\. We evaluate this metric across multiple embedding model pairs\.

As shown in Figure[8](https://arxiv.org/html/2605.31100#A1.F8), the behavior splits sharply by encoder family\. For pairs of contrastive encoders \(panels a–c\), the Jaccard index starts atJ≈0\.7J\\\!\\approx\\\!0\.7–0\.80\.8atk=1k\{=\}1and decays monotonically to a plateau around0\.370\.37–0\.450\.45atk=50k\{=\}50, empirically confirming Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2)\. In contrast, when at least one side is a non\-contrastive encoder \(panels d–f\), the Jaccard index never enters this short\-range high\-consistency regime:GloVe↔\\leftrightarrowfastText\(panel d\) stays nearJ≈0\.1J\\\!\\approx\\\!0\.1–0\.170\.17over the entire range ofkk, and aMistral/OpenAIquery side that yieldsJ≈0\.8J\\\!\\approx\\\!0\.8against a contrastive target collapses toJ≈0\.2J\\\!\\approx\\\!0\.2against a non\-contrastive target \(panels e–f\)\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/)aKaLM↔\\leftrightarrowGTE
![Refer to caption](https://arxiv.org/html/2605.31100v1/x20.png)bQwen↔\\leftrightarrowGTE
![Refer to caption](https://arxiv.org/html/2605.31100v1/x21.png)cQwen↔\\leftrightarrowKaLM
![Refer to caption](https://arxiv.org/html/2605.31100v1/x22.png)dGloVe↔\\leftrightarrowfastText
![Refer to caption](https://arxiv.org/html/2605.31100v1/x23.png)eMistral↔\\leftrightarrowOpenAIandGloVe
![Refer to caption](https://arxiv.org/html/2605.31100v1/x24.png)fOpenAI↔\\leftrightarrowMistralandfastText

Figure 8:Cross\-embedding retrieval consistency analysis:SciFact, Each panel reports the mean±\\pm1 std of the per\-query Jaccard index between top\-kkretrieval results from two embedding spaces over 100 random queries\.

## Appendix BDetails of Section[4](https://arxiv.org/html/2605.31100#S4)

### B\.1Why FPS for View Sampling \(Stability Analysis\)

We justify FPS view sampling for the kernelized distance\-to\-anchor hash used in the main text\. In a nutshell, we show that a hash is stable if the anchor\-induced distance coordinates provide diverse directional information, and clustered anchors yield redundant coordinates and amplify cross\-model distance distortions\.

Local stability via the tangent jacobianSince we employ cosine distance, we operate on the unit hypersphere𝕊D−1⊂ℝD\\mathbb\{S\}^\{D\-1\}\\subset\\mathbb\{R\}^\{D\}\. The hashing function𝐡𝒜:𝕊D−1→ℝmt\\mathbf\{h\}\_\{\\mathcal\{A\}\}:\\mathbb\{S\}^\{D\-1\}\\to\\mathbb\{R\}^\{m\_\{t\}\}can be defined by the kernel with cosine distance:

h𝒜​\(w\)j=exp⁡\(−1−⟨w,aj′⟩σ\),j=1,…,mt\.h\_\{\\mathcal\{A\}\}\(w\)\_\{j\}=\\exp\\left\(\-\\frac\{1\-\\langle w,a^\{\\prime\}\_\{j\}\\rangle\}\{\\sigma\}\\right\),\\quad j=1,\\dots,m\_\{t\}\.We differentiateh𝒜​\(w\)jh\_\{\\mathcal\{A\}\}\(w\)\_\{j\}with respect towwand restrict the domain to the tangent space of the sphere,Tv​𝕊D−1=\{z∈ℝD:⟨z,v⟩=0\}T\_\{v\}\\mathbb\{S\}^\{D\-1\}=\\\{z\\in\\mathbb\{R\}^\{D\}:\\langle z,v\\rangle=0\\\}\. Thejj\-th row of the JacobianJ𝒜​\(v\)∈ℝmt×DJ\_\{\\mathcal\{A\}\}\(v\)\\in\\mathbb\{R\}^\{m\_\{t\}\\times D\}is given by the projection of the gradient ontoTvT\_\{v\}:

\(J𝒜​\(v\)\)j,:=1σ​hj​\(v\)⋅\(aj′−⟨aj′,v⟩​v\)⊤⏟𝐩j⊤\.\(J\_\{\\mathcal\{A\}\}\(v\)\)\_\{j,:\}=\\frac\{1\}\{\\sigma\}h\_\{j\}\(v\)\\cdot\\underbrace\{\(a^\{\\prime\}\_\{j\}\-\\langle a^\{\\prime\}\_\{j\},v\\rangle v\)^\{\\top\}\}\_\{\\mathbf\{p\}\_\{j\}^\{\\top\}\}\.Here,𝐩j\\mathbf\{p\}\_\{j\}represents the component of the anchoraj′a^\{\\prime\}\_\{j\}orthogonal to the queryvv\.

We assume the data lies on a submanifoldℳ⊂𝕊D−1\\mathcal\{M\}\\subset\\mathbb\{S\}^\{D\-1\}of intrinsic dimensiond≪Dd\\ll D\. For the hash to be stable \(locally injective\) onℳ\\mathcal\{M\}, the mapping must distinguish perturbations in any tangent direction\.Let𝚷v∈ℝD×d\\mathbf\{\\Pi\}\_\{v\}\\in\\mathbb\{R\}^\{D\\times d\}be an orthonormal basis for the tangent spaceTv​ℳT\_\{v\}\\mathcal\{M\}\. The Restricted Jacobian𝐉ℳ​\(v\)∈ℝmt×d\\mathbf\{J\}\_\{\\mathcal\{M\}\}\(v\)\\in\\mathbb\{R\}^\{m\_\{t\}\\times d\}is defined as:

𝐉ℳ​\(v\)=J𝒜​\(v\)⋅𝚷v\.\\mathbf\{J\}\_\{\\mathcal\{M\}\}\(v\)=J\_\{\\mathcal\{A\}\}\(v\)\\cdot\\mathbf\{\\Pi\}\_\{v\}\.A necessary condition for stability is that𝐉ℳ​\(v\)\\mathbf\{J\}\_\{\\mathcal\{M\}\}\(v\)has full column rank, which requiresmt≥dm\_\{t\}\\geq d\.The robustness of this stability is quantified by the condition numberκ=σmax/σmin\\kappa=\\sigma\_\{\\max\}/\\sigma\_\{\\min\}of𝐉ℳ​\(v\)\\mathbf\{J\}\_\{\\mathcal\{M\}\}\(v\)\. Ifκ\\kappais large \(σmin≈0\\sigma\_\{\\min\}\\approx 0\), the hash is insensitive to changes along the corresponding singular vector, leading to ambiguity\.

Geometric Optimality of FPSThe singular values of𝐉ℳ​\(v\)\\mathbf\{J\}\_\{\\mathcal\{M\}\}\(v\)are determined by the geometric arrangement of the anchor projections\{𝐩j\}\\\{\\mathbf\{p\}\_\{j\}\\\}\. Random sampling can select multiple anchors that are locally redundant, making the manifold\-tangent components\{Πv⊤​pj\}\\\{\\Pi\_\{v\}^\{\\top\}p\_\{j\}\\\}nearly collinear and leaving some tangent directions weakly sensed\. This drivesσmin→0\\sigma\_\{\\min\}\\to 0\. FPS greedily maximizes minimum pairwise distance, discouraging near\-duplicates in each view and promotes largerσmin\\sigma\_\{\\min\}\. This reduce angular redundancy among\{Πv⊤​pj\}\\\{\\Pi\_\{v\}^\{\\top\}p\_\{j\}\\\}and thus improve the conditioning ofJℳ​\(v\)J\_\{\\mathcal\{M\}\}\(v\)in practice\.

Additionally, FPS acts as a covering strategy \(a greedykk\-center heuristic\): selected anchors are spread across the current anchor pool, increasing the chance that a sampled view contains anchors that arelocally relevantfor many different query points\.

FPS effectively reduce view\-specific collisions and make per\-view hash\-space matching more stable, which in turn improves the quality of votes aggregated by the bootstrapping procedure in Section[4](https://arxiv.org/html/2605.31100#S4)\. Consistent with this analysis, FPS significantly outperforms random\-based anchor selection in our ablations \(Section[5\.3](https://arxiv.org/html/2605.31100#S5.SS3)\)\.

### B\.2Properties of View Scheduling

This appendix expands on the scheduling rule used in Section[4](https://arxiv.org/html/2605.31100#S4)for choosing the number of viewsmtm\_\{t\}and the anchors per viewsts\_\{t\}as the paired\-anchor poolℒt−1\\mathcal\{L\}\_\{t\-1\}grows\. The schedule provides two practical properties/guarantees: \(i\) increasing view diversity over iterations, and \(ii\) stable per\-anchor coverage and computation\.

Objectives\.At iterationtt, letnt:=\|ℒt−1\|n\_\{t\}:=\|\\mathcal\{L\}\_\{t\-1\}\|denote the size of the current paired\-anchor pool\. A view containssts\_\{t\}paired anchors, and we samplemtm\_\{t\}views\. Define theper\-view anchor fractionρt:=st/nt\\rho\_\{t\}:=s\_\{t\}/n\_\{t\}\.

The view schedule decidesmtm\_\{t\}andρt\\rho\_\{t\}\(equivalently,sts\_\{t\}\), while aiming to satisfy two practical objectives:

\(O1\) Increasing diversity & locality\.Asntn\_\{t\}grows, we would like to sample more views \(so more chances to hit short\-range neighborhoods\) while making each view a smaller fraction of the poolℒt−1\\mathcal\{L\}\_\{t\-1\}\(so views are more “local” and less dominated by far anchors\)\.

\(O2\) Stable per\-anchor coverage and computation\.If views were sampled uniformly, a given anchor would appear in an expectedmt​ρtm\_\{t\}\\rho\_\{t\}views per iteration\. We would like this quantity to remain roughly stable over iterations \(so anchors are neither over\-used nor ignored\), and we would like the total anchor usagemt​stm\_\{t\}s\_\{t\}per iteration to scale reasonably\.

Anchor pool growth ratiogtg\_\{t\}\.We measure progress by growth ratiogt:=\|ℒt−1\|max⁡\{\|𝒮\|,1\}≥1g\_\{t\}:=\\frac\{\|\\mathcal\{L\}\_\{t\-1\}\|\}\{\\max\\\{\|\\mathcal\{S\}\|,1\\\}\}\\;\\geq\\;1, where𝒮\\mathcal\{S\}is the \(tiny\) initial seed set\. This ratio ensures that the schedule depends on relative growth of the anchor pool rather than its absolute size,*e\.g\.,*increasing from 10 to 100 anchors and from 100 to 1000 anchors both correspond togt=10g\_\{t\}=10\.

Coverage\-preserving parameterization\.To satisfy \(O2\), we parameterize the schedule via a single scaling functionsf​\(g\)≥1\\mathrm\{sf\}\(g\)\\geq 1that is set to satisfy the following:

mt≈m0​sf​\(gt\),ρt≈ρ0/sf​\(gt\),m\_\{t\}\\;\\approx\\;m\_\{0\}\\,\\mathrm\{sf\}\(g\_\{t\}\),\\qquad\\rho\_\{t\}\\;\\approx\\;\\rho\_\{0\}/\\mathrm\{sf\}\(g\_\{t\}\),wherem0∈ℕm\_\{0\}\\in\\mathbb\{N\}andρ0∈\(0,1\]\\rho\_\{0\}\\in\(0,1\]are base parameters\. Using this view scheduling would assure approximate invariance \(ignoring rounding\)mt​ρt≈m0​ρ0m\_\{t\}\\rho\_\{t\}\\approx m\_\{0\}\\rho\_\{0\},*i\.e\.,*a constant expected per\-anchor participation rateunder uniform sampling\.

This would also imply linear scaling of total anchor usage sincemt​st=mt​ρt​nt≈m0​ρ0​\|ℒt−1\|m\_\{t\}s\_\{t\}=m\_\{t\}\\rho\_\{t\}\\,n\_\{t\}\\approx m\_\{0\}\\rho\_\{0\}\\,\|\\mathcal\{L\}\_\{t\-1\}\|\. Thus it increases the number of views while controlling per\-iteration work\.

Determiningsf​\(g\)\\mathrm\{sf\}\(g\)\. For convenience, we considersf​\(g\)=1\+u​\(g\)\\mathrm\{sf\}\(g\)=1\+u\(g\), where we useu​\(g\)u\(g\)to represent the increment induced by the growth of anchor pool\. Thus we requireu​\(1\)=0u\(1\)=0asg=1g=1corresponds to the initial state\|ℒt−1\|=\|𝒮\|\|\\mathcal\{L\}\_\{t\-1\}\|=\|\\mathcal\{S\}\|with no growth occurred\.

To decideu​\(g\)u\(g\), we consider a simple principle: increment in schedule should depend only on the expansion rate of the anchors not the current scalegg\. Letggbe the current growth ratio*w\.r\.t\.*ℒr−1\\mathcal\{L\}\_\{r\-1\}and letg′g^\{\\prime\}denote the growth ratio after expansion toℒr\\mathcal\{L\}\_\{r\}\. Define the expansion ratioλ:=g′/g\\lambda:=g^\{\\prime\}/g\. Then this principle says, if the anchor pool grows by a factorλ≥1\\lambda\\geq 1, we want the increase in schedule \(size and number of views\) to depend only onλ\\lambdaand not on the current growth scalegg\. In terms ofu​\(g\)=sf​\(g\)−1u\(g\)=\\mathrm\{sf\}\(g\)\-1, this boils down tou​\(λ​g\)−u​\(g\)=u​\(λ\)u\(\\lambda g\)\-u\(g\)=u\(\\lambda\)for∀g,λ≥1\\forall g,\\lambda\\geq 1; equivalently,

u​\(λ​g\)=u​\(λ\)\+u​\(g\)∀g,λ≥1\.u\(\\lambda g\)=u\(\\lambda\)\+u\(g\)\\qquad\\forall g,\\lambda\\geq 1\.By the classical Cauchy additive functional equation \(after a log change of variables\), assuminguuis continuous, the unique solutions to this are of formu​\(g\)=c​log⁡gu\(g\)=c\\log gfor some constantc≥0c\\geq 0\.

Thus we adopt the logarithmic scale factor

sft:=sf​\(gt\)=1\+c​log⁡gt\.\\mathrm\{sf\}\_\{t\}:=\\mathrm\{sf\}\(g\_\{t\}\)=1\+c\\log g\_\{t\}\.This grows sublinearly and avoids explodingTtT\_\{t\}as the anchor pool expands, while still increasing view diversity\. We empirically verify the robustness of view\-scheduling hyperparametersm0m\_\{0\}andρ0\\rho\_\{0\}\(see Appendix[C\.2\.4](https://arxiv.org/html/2605.31100#A3.SS2.SSS4)\)\.

### B\.3Per\-view Link Proposal at Scale

Whenmax⁡\(\|𝐄1\|,\|𝐄2\|\)≤τ\\max\(\|\\mathbf\{E\}\_\{1\}\|,\|\\mathbf\{E\}\_\{2\}\|\)\\leq\\tau\(τ=5×105\\tau=5\\\!\\times\\\!10^\{5\}\), GEH runs the global path of §[4](https://arxiv.org/html/2605.31100#S4)\. Otherwise, at iterationtt, we replace FPS with akk\-means partition ofℒt−1\\mathcal\{L\}\_\{t\-1\}inE1E\_\{1\}to ensure diversity of views\. The number of partitions ismt=⌈\|ℒt−1\|/dmax⌉m\_\{t\}=\\lceil\|\\mathcal\{L\}\_\{t\-1\}\|/d\_\{\\max\}\\rceil, wheredmax=max⁡\(dE1,dE2\)d\_\{\\max\}=\\max\(d\_\{E\_\{1\}\},d\_\{E\_\{2\}\}\), which ensures each view has sufficient anchors for a dimensionally well\-posed local neighborhood\. Each anchor inℒt−1\\mathcal\{L\}\_\{t\-1\}is included in itsρ\\rhonearest clusters \(ρ=2\\rho\{=\}2in our experiments\) which matches with the constant per\-anchor participation rate in §[B\.2](https://arxiv.org/html/2605.31100#A2.SS2)\.

Within each view𝒜t,k\\mathcal\{A\}\_\{t,k\}, each paired anchor\(a,a′\)∈𝒜t,k\(a,a^\{\\prime\}\)\\in\\mathcal\{A\}\_\{t,k\}contributes its own local neighborhood: letNNkNN​\(a,E1\)\\mathrm\{NN\}\_\{k\_\{\\mathrm\{NN\}\}\}\(a,E\_\{1\}\)denote thekNNk\_\{\\mathrm\{NN\}\}nearest ambient neighbors ofaainE1E\_\{1\}, and analogouslyNNkNN​\(a′,E2\)\\mathrm\{NN\}\_\{k\_\{\\mathrm\{NN\}\}\}\(a^\{\\prime\},E\_\{2\}\)\. The view’s local sets are the unionsS𝒜t,k\(1\):=⋃\(a,a′\)∈𝒜t,kNNkNN​\(a,E1\)S\_\{\\mathcal\{A\}\_\{t,k\}\}^\{\(1\)\}\\;:=\\;\\bigcup\_\{\(a,a^\{\\prime\}\)\\in\\mathcal\{A\}\_\{t,k\}\}\\mathrm\{NN\}\_\{k\_\{\\mathrm\{NN\}\}\}\(a,E\_\{1\}\)and analogouslyS𝒜t,k\(2\)S\_\{\\mathcal\{A\}\_\{t,k\}\}^\{\(2\)\}\. We then build the distance\-to\-anchor signatures of §[3](https://arxiv.org/html/2605.31100#S3)only for points inS𝒜t,k\(1\)∪S𝒜t,k\(2\)S\_\{\\mathcal\{A\}\_\{t,k\}\}^\{\(1\)\}\\cup S\_\{\\mathcal\{A\}\_\{t,k\}\}^\{\(2\)\}, compute𝖼𝗌𝗅𝗌𝒜t,k\{\\mathsf\{csls\}\}\_\{\\mathcal\{A\}\_\{t,k\}\}withinS𝒜t,k\(1\)×S𝒜t,k\(2\)S\_\{\\mathcal\{A\}\_\{t,k\}\}^\{\(1\)\}\\\!\\times\\\!S\_\{\\mathcal\{A\}\_\{t,k\}\}^\{\(2\)\}, and emit MNN proposals𝒫t,k\\mathcal\{P\}\_\{t,k\}within that local bipartite set\.

The local sets are identified via twokk\-NN indices overE1E\_\{1\}andE2E\_\{2\}separately, built once at the start of𝖦𝖤𝖧\\mathsf\{GEH\}\(costO​\(\|E1\|\+\|E2\|\)O\(\|E\_\{1\}\|\+\|E\_\{2\}\|\)\) and reused across allTTiterations and∑tmt\\sum\_\{t\}m\_\{t\}views\. Each per\-view local\-set lookup is then a singlekNNk\_\{\\mathrm\{NN\}\}\-NN query against this static index\.

## Appendix CDetails of Section[5](https://arxiv.org/html/2605.31100#S5)

### C\.1Experimental Setup

#### C\.1\.1Datasets

We conduct main experiments on five benchmark datasets from BEIR\(Thakur et al\.,[2021](https://arxiv.org/html/2605.31100#bib.bib35)\), covering biomedical retrieval, financial analysis, citation prediction, argument mining, and fact verification\. Below we briefly describe each dataset\.

SciFact\(Wadden et al\.,[2020](https://arxiv.org/html/2605.31100#bib.bib39)\)is a scientific fact\-checking benchmark\. Queries consist of short scientific claims, while documents are abstracts of scientific papers\. The goal is to retrieve supporting or refuting evidence for each claim\.

NFCorpus\(Boteva et al\.,[2016](https://arxiv.org/html/2605.31100#bib.bib5)\)focuses on health\-related information retrieval\. Queries come from user\-generated content such as blog posts, Q&A threads, and video transcripts, and the corpus is built from medical articles in PubMed\.

ArguAna\(Wachsmuth et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib38)\)addresses argument retrieval\. Given an argument as a query, the task is to identify the most relevant counterarguments from a collection of argument pairs mined from online debate portals\.

SciDocs\(Cohan et al\.,[2020](https://arxiv.org/html/2605.31100#bib.bib7)\)is a citation prediction dataset derived from scientific publications\. Queries are scientific papers, and the task is to retrieve related works among a large held\-out collection\.

FiQA\(Maia et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib25)\)comes from the financial domain\. The queries are investment\-related questions posted on StackExchange, while the corpus contains financial articles and answers from the same platform\.

FEVER\(Thorne et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib36)\)is a fact\-verification dataset\. The queries are short factual claims, and the corpus introductory sections of Wikipedia pages\.

Table[5](https://arxiv.org/html/2605.31100#A3.T5)provides dataset statistics, including query counts, corpus sizes, and the average number of relevant documents per query\.

Table 5:BEIR dataset statistics:\#test queries \(QQ\), test corpus size \(\|C\|\|C\|\), and avg\. relevant docs/query \(RR\)\.
#### C\.1\.2Embedding Models

We generate embeddings using a mix of proprietary API services and open\-weight embedding models\.

MistralWe use Mistral’s commercial Embeddings API with the mistral\-embed model\(Jiang et al\.,[2023](https://arxiv.org/html/2605.31100#bib.bib18)\)\. Mistral’s model family is designed for efficient inference; we use the hosted embeddings endpoint as provided by Mistral\.

OpenAIWe use OpenAI’s Embeddings API with text\-embedding\-3\-small\([OpenAI,](https://arxiv.org/html/2605.31100#bib.bib27)\)\. We embed queries and documents using the same embedding model and endpoint \(i\.e\., no separate query/document encoders or prompt format is required by the API\)\.

GTEWe use gte\-Qwen2\-7B\-instruct\(Li et al\.,[2023](https://arxiv.org/html/2605.31100#bib.bib23)\), a 7B\-parameter text embedding model in the General Text Embeddings \(GTE\) family, trained on top of theQwen2\-7Bbackbone\. We follow the recommended usage: queries are encoded with a query\-specific prompt, whereas documents are encoded without instructions\.

QwenWe use Qwen3\-Embedding\-8B\(Zhang et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib47)\), an instruction\-aware embedding model built on the Qwen3 foundation models\. We follow the recommended asymmetric encoding: queries are encoded with a query\-specific prompt, while documents are encoded unchanged\.

KaLMWe use KaLM\-Embedding\-Gemma3\-12B\(Zhao et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib48)\), a 12B\-parameter embedding model from Tencent built on the Gemma3 foundation\. The model uses symmetric encoding for both queries and documents, with L2\-normalized output embeddings\.

Table[6](https://arxiv.org/html/2605.31100#A3.T6)summarizes the embedding dimensionality of the models used in our main experiments\.

Table 6:Embedding output dimensionality for the models used in our main experiments\.Abbr\. denotes the model shorthand used throughout the paper\.
#### C\.1\.3Baseline Methods

For all alignment baselines \(𝖫𝗂𝗇𝖾𝖺𝗋\\mathsf\{Linear\},𝖢𝖢𝖠\\mathsf\{CCA\},𝖬𝖫𝖯\\mathsf\{MLP\},𝖯𝗋𝗈𝖼\\mathsf\{Proc\},𝖱𝖢𝖲𝖫𝖲\\mathsf\{RCSLS\}\), we align𝐄1\\mathbf\{E\}\_\{1\}with𝐄2\\mathbf\{E\}\_\{2\}space and infer links using CSLS cosine similarity\-based mutual nearest neighbors\.

Linear transformation \(𝖫𝗂𝗇𝖾𝖺𝗋\\mathsf\{Linear\}\)We performs alignment by learning a linear map from𝐄1\\mathbf\{E\}\_\{1\}to𝐄2\\mathbf\{E\}\_\{2\}\. Given seedlinks\{\(ai,bi\)\}i∈𝒮\\\{\(a\_\{i\},b\_\{i\}\)\\\}\_\{i\\in\\mathcal\{S\}\}, whereai∈ℝdsa\_\{i\}\\in\\mathbb\{R\}^\{d\_\{s\}\}andbi∈ℝdtb\_\{i\}\\in\\mathbb\{R\}^\{d\_\{t\}\}are embeddings of the same item in the source and target spaces, respectively, we learn a bias\-free linear transformationW∈ℝdt×dsW\\in\\mathbb\{R\}^\{d\_\{t\}\\times d\_\{s\}\}by minimizing mean squared errorℒ​\(W\)=1\|𝒮\|​∑i∈𝒮‖W​ai−bi‖22\\mathcal\{L\}\(W\)=\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{i\\in\\mathcal\{S\}\}\\\|Wa\_\{i\}\-b\_\{i\}\\\|\_\{2\}^\{2\}\. We optimizeWWwith Adam \(learning rate10−310^\{\-3\}\) for 100 epochs\.

Canonical Correlation Analysis \(𝖢𝖢𝖠\\mathsf\{CCA\}\)\.We standardize each space independently, fit CCA on the seed pairs to learn one linear projection per space that maximizes correlation between projected seed embeddings\.

Multi\-Layer Perceptron \(𝖬𝖫𝖯\\mathsf\{MLP\}\)\.We train a single\-hidden\-layer MLP mapping from the source embedding dimension to the target embedding dimension, with hidden width 512 and ReLU activations\. We train on seed pairs using a cosine loss, optimization uses Adam with learning rate10−210^\{\-2\}and weight decay10−510^\{\-5\}for 100 training epochs\.

Procrustes \(𝖯𝗋𝗈𝖼\\mathsf\{Proc\}\)\.We align the two embedding spaces by solving the orthogonal Procrustes problem on the seed links\. Let𝐗,𝐘∈ℝn×d\\mathbf\{X\},\\mathbf\{Y\}\\in\\mathbb\{R\}^\{n\\times d\}denote the corresponding seed embeddings \(rows are paired items\)\. We estimate an orthogonal map𝐑⋆=arg⁡min𝐑∈ℝd×d⁡‖𝐗𝐑−𝐘‖F2s\.t\.𝐑⊤​𝐑=𝐈\\mathbf\{R\}^\{\\star\}=\\arg\\min\_\{\\mathbf\{R\}\\in\\mathbb\{R\}^\{d\\times d\}\}\\ \\\|\\mathbf\{X\}\\mathbf\{R\}\-\\mathbf\{Y\}\\\|\_\{F\}^\{2\}\\quad\\text\{s\.t\.\}\\quad\\mathbf\{R\}^\{\\top\}\\mathbf\{R\}=\\mathbf\{I\}\. Let𝐀=𝐗⊤​𝐘\\mathbf\{A\}=\\mathbf\{X\}^\{\\top\}\\mathbf\{Y\}and compute its SVD𝐀=𝐔​𝚺​𝐕⊤\\mathbf\{A\}=\\mathbf\{U\}\\bm\{\\Sigma\}\\mathbf\{V\}^\{\\top\}\. A closed\-form optimum is given by𝐑⋆=𝐔𝐕⊤\\mathbf\{R\}^\{\\star\}=\\mathbf\{U\}\\mathbf\{V\}^\{\\top\}\. At inference time, we align embedding𝐱\\mathbf\{x\}via𝐱𝐑⋆\\mathbf\{x\}\\mathbf\{R\}^\{\\star\}\.

Relaxed Cross\-domain Similarity Local Scaling \(𝖱𝖢𝖲𝖫𝖲\\mathsf\{RCSLS\}\)We implement RCSLS\(Joulin et al\.,[2018](https://arxiv.org/html/2605.31100#bib.bib19)\), which directly optimizes a linear map to improve CSLS\-based retrieval and mitigate hubness\. We initialize with the Procrustes solution and optimize the RCSLS objective on the seed pairs using gradient\-based optimization, projecting back to the orthogonal group after each update\.

Unbalanced Gromov\-Wasserstein \(𝖴𝖦𝖶\\mathsf\{UGW\}\)\.We implement Unbalanced Gromov\-Wasserstein\(Séjourné et al\.,[2021](https://arxiv.org/html/2605.31100#bib.bib32)\)with POT’s log\-domain Sinkhorn solver on intra\-view cosine distance matrices \(normalized by their mean\), warm\-started by a seed\-biased coupling on the supervised pairs\. Links are read from the resulting transport plan via mutual argmax\.

Bootstrapping parallel anchors \(𝖠𝖮\\mathsf\{AO\}\)\.We implement AO\(Cannistraci et al\.,[2023](https://arxiv.org/html/2605.31100#bib.bib6)\)to discover links via relative representations and Sinkhorn OT\. Weℓ2\\ell\_\{2\}\-normalize embeddings and follow the original optimization schedule \(250 steps; one Sinkhorn iteration per step\)\.We set the anchor budget to the true overlap size,K=α​\|𝒟\|K=\\alpha\|\\mathcal\{D\}\|\(i\.e\., AO is given the overlap cardinality\)\. All other baselines and𝖦𝖤𝖧\\mathsf\{GEH\}does not assume knowledge ofα\\alpha\.

#### C\.1\.4Computational Resources

All experiments were run on a Kubernetes cluster\. Each run was allocated a single compute node with an AMD EPYC 7713P \(64 cores\), 896 GB RAM, and one NVIDIA A100 GPU \(80 GB\), running Ubuntu 22\.04\.5 LTS\.

### C\.2Implementation Details and Extended Analysis of𝖦𝖤𝖧\\mathsf\{GEH\}

#### C\.2\.1Stopping Criterion

We monitor the*mutual\-NN ratio*, defined as the fraction of points that participate in at least one mutual nearest\-neighbor \(MNN\) pair in an iteration\. Let𝒫t\\mathcal\{P\}\_\{t\}be the set of MNN pairs returned at iterationtt, and let

Ut:=\{u∈𝐄1:∃v,\(u,v\)∈𝒫t\}∪\{v∈𝐄2:∃u,\(u,v\)∈𝒫t\}U\_\{t\}:=\\\{u\\in\\mathbf\{E\}\_\{1\}:\\exists v,\(u,v\)\\in\\mathcal\{P\}\_\{t\}\\\}\\ \\cup\\ \\\{v\\in\\mathbf\{E\}\_\{2\}:\\exists u,\(u,v\)\\in\\mathcal\{P\}\_\{t\}\\\}whereMt:=\|Ut\|,N:=max⁡\{\|𝐄1\|,\|𝐄2\|\}M\_\{t\}:=\|U\_\{t\}\|,N:=\\max\\\{\|\\mathbf\{E\}\_\{1\}\|,\|\\mathbf\{E\}\_\{2\}\|\\\}\. We defineMNN​\_​ratiot:=Mt/N\\mathrm\{MNN\\\_ratio\}\_\{t\}:=M\_\{t\}/Nand terminate bootstrapping if any of the following holds: \(i\)Mt=0M\_\{t\}=0\(no mutual pairs\); \(ii\) after burn\-inTminT\_\{\\min\}\(default1010\), the ratio stabilizes, i\.e\.,

maxi∈\{t−Tmin\+1,…,t\}⁡\|MNN​\_​ratioi−MNN​\_​ratioi−1\|<0\.01;\\max\_\{i\\in\\\{t\-T\_\{\\min\}\+1,\\dots,t\\\}\}\\left\|\\mathrm\{MNN\\\_ratio\}\_\{i\}\-\\mathrm\{MNN\\\_ratio\}\_\{i\-1\}\\right\|<0\.01;or \(iii\) a maximum of 100 iterations is reached\.

#### C\.2\.2Additional Posterior/Precision–Proximity Analyses Across Datasets

We reportℒ1\\mathcal\{L\}\_\{1\}\(link set from the first iteration\) under 20% ground\-truth overlap with 15 seed pairs onMistralandOpenAIembeddings\. Predicted links are grouped into 30 quantile bins by their minimum cosine distance \(in the corresponding embedding space\) to the anchors that voted for them; we plot per\-bin precision and mean posterior\. As shown in Figure[9](https://arxiv.org/html/2605.31100#A3.F9), links supported by anchors at smaller distances are more accurate, consistent with our Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2), and the mean posterior closely tracks empirical precision across bins, indicating that the confidence score is well\-calibrated\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x25.png)aNFCorpus
![Refer to caption](https://arxiv.org/html/2605.31100v1/x26.png)bSciFact
![Refer to caption](https://arxiv.org/html/2605.31100v1/x27.png)cArguAna
![Refer to caption](https://arxiv.org/html/2605.31100v1/x28.png)dSciDocs

Figure 9:Posterior/Precision vs\. anchor proximity\.ForMistral↔\\leftrightarrowOpenAIlinking atα=0\.2\\alpha=0\.2overlap with\|𝒮\|=15\|\\mathcal\{S\}\|=15seeds, we bin predicted links by their minimum distance to the anchors that voted for them \(30 quantile bins\) and plot per\-bin empirical precision and mean posterior confidence\.
#### C\.2\.3Robustness to𝒮\\mathcal\{S\}Initialization

We analyze the sensitivity of𝖦𝖤𝖧\\mathsf\{GEH\}to the structural properties of the initial supervision set𝒮\\mathcal\{S\}\. Since𝖦𝖤𝖧\\mathsf\{GEH\}relies on a small set of anchors sampled from the overlap region between two embedding sets, we test whether different strategies for selecting these seed pairs materially affect final performance\.

Setup\.We evaluate betweenQwenandOpenAIembeddings across five datasets\. We hold the overlap ratioα\\alphaand number of seed anchors\|𝒮\|\|\\mathcal\{S\}\|constant while varying the sampling strategy used to select\|𝒮\|\|\\mathcal\{S\}\|from the overlap\.

We compare 4 different strategies:

- ∙\\bulletNearest: Randomly choose one anchor and take itsk−1k\-1nearest neighbors \(inQwenspace\)\. This produces a localized supervision pattern\.
- ∙\\bulletRandom: Seeds are sampled uniformly without replacement\. This is the default strategy employed in our main experiments, requiring no prior knowledge of the overlap manifold\.
- ∙\\bulletFPS: Greedily build a seed set by repeatedly selecting the candidate that maximizes its minimum cosine distance to previously selected seeds\. This yields a highly diverse seed set and has a “spread\-out” supervision\.
- ∙\\bulletCentroids: We cluster the overlapping vectors \(inQwenspace\) usingkk\-means \(wherek=\|𝒮\|k=\|\\mathcal\{S\}\|\) and select the vectors nearest to the centroids\. This ensures seed anchors are representative of the distribution\.

Table[7](https://arxiv.org/html/2605.31100#A3.T7)reports F1 across all 45 \(dataset,α\\alpha,\|𝒮\|\|\\mathcal\{S\}\|\) configurations\. Performance is stable across strategies \(the largest max−\-min spread is 3\.0 pp\), confirming that𝖦𝖤𝖧\\mathsf\{GEH\}is robust to the choice of seed anchors\.

Table 7:F1 \(%\\%\) by seed\-initialization strategyonQwen↔\\leftrightarrowOpenAIlinking across five datasets, three overlap ratiosα∈\{0\.15,0\.20,0\.30\}\\alpha\\in\\\{0\.15,0\.20,0\.30\\\}, and three seed budgets\|𝒮\|∈\{15,20,30\}\|\\mathcal\{S\}\|\\in\\\{15,20,30\\\}\. Bold marks the best strategy within each row\. TheMax Gapcolumn reports the spread \(max−\-min\) across strategies in percentage points \(pp\); the bolded value is the largest spread observed across all 45 setups\.
#### C\.2\.4Sensitivity tocc,ρ0\\rho\_\{0\}, andkCSLSk\_\{\\mathrm\{CSLS\}\}

We evaluate robustness onSciFactandNFCorpususingMistral↔\\leftrightarrowOpenAIembeddings, fixing the overlap ratio toα=0\.3\\alpha=0\.3and the seed budget to\|𝒮\|=15\|\\mathcal\{S\}\|=15\. We sweep three hyperparameters: the view\-growth constantccinsf​\(g\)=1\+c​log⁡g\\mathrm\{sf\}\(g\)=1\+c\\log g, the base per\-view anchor fractionρ0\\rho\_\{0\}, and the CSLS neighborhood sizekCSLSk\_\{\\mathrm\{CSLS\}\}used for MNN retrieval\. When varyingρ0\\rho\_\{0\}, we setm0=⌈2/ρ0⌉m\_\{0\}=\\lceil 2/\\rho\_\{0\}\\rceilto keep the expected per\-anchor coverage approximately constant \(m0​ρ0≈2m\_\{0\}\\rho\_\{0\}\\approx 2, up to rounding\)\. For each sweep, we report the mean F1 across the two datasets and define the stable range as configurations achieving at least0\.97×0\.97\\timesthe best mean F1 in that sweep\. As shown in Figure[10](https://arxiv.org/html/2605.31100#A3.F10), performance is stable over broad intervals:c∈\[0\.2,1\]c\\in\[0\.2,1\],kCSLS∈\[4,50\]k\_\{\\mathrm\{CSLS\}\}\\in\[4,50\], andρ0∈\[0\.2,0\.5\]\\rho\_\{0\}\\in\[0\.2,0\.5\]\. We usec=0\.3c=0\.3,ρ0=0\.4\\rho\_\{0\}=0\.4, andkCSLS=50k\_\{\\mathrm\{CSLS\}\}=50in all experiments\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x29.png)

Figure 10:Sensitivity to view scheduling and CSLS hyperparameters\.F1 onSciFactandNFCorpusforMistral↔\\leftrightarrowOpenAIlinking with overlap ratioα=0\.3\\alpha=0\.3and\|𝒮\|=15\|\\mathcal\{S\}\|=15seeds\. We vary \(left\) the logarithmic growth constantccinsf​\(g\)=1\+c​log⁡g\\mathrm\{sf\}\(g\)=1\+c\\log g, \(middle\) the CSLS neighborhood sizekCSLSk\_\{\\mathrm\{CSLS\}\}, and \(right\) the base per\-view anchor fractionρ0\\rho\_\{0\}\. The shaded gray region denotes the near\-optimal range achieving at least97%97\\%of the peak F1 for each sweep\.

### C\.3Additional Results of𝖦𝖤𝖧\\mathsf\{GEH\}

#### C\.3\.1Additional Results on Vector Linking

We report the complete experimental grid over55model pairs and55datasets, across33overlap ratios and33seed budgets\. Across this grid,𝖦𝖤𝖧\\mathsf\{GEH\}achieves the best performance among all methods in the vast majority of settings\.

Table 8:Vector linking onNFCorpus\(GTE↔\\leftrightarrowMistral\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 9:Vector linking onSciDocs\(GTE↔\\leftrightarrowMistral\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 10:Vector linking onArguAna\(GTE↔\\leftrightarrowMistral\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 11:Vector linking onSciFact\(GTE↔\\leftrightarrowMistral\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 12:Vector linking onFiQA\(GTE↔\\leftrightarrowMistral\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 13:Vector linking onNFCorpus\(GTE↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 14:Vector linking onSciDocs\(GTE↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 15:Vector linking onArguAna\(GTE↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 16:Vector linking onSciFact\(GTE↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 17:Vector linking onFiQA\(GTE↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 18:Vector linking onNFCorpus\(Mistral↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 19:Vector linking onSciDocs\(Mistral↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 20:Vector linking onArguAna\(Mistral↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 21:Vector linking onFiQA\(Mistral↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 22:Vector linking onNFCorpus\(Qwen↔\\leftrightarrowKaLM\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 23:Vector linking onSciDocs\(Qwen↔\\leftrightarrowKaLM\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 24:Vector linking onArguAna\(Qwen↔\\leftrightarrowKaLM\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 25:Vector linking onSciFact\(Qwen↔\\leftrightarrowKaLM\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 26:Vector linking onFiQA\(Qwen↔\\leftrightarrowKaLM\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 27:Vector linking onNFCorpus\(Qwen↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 28:Vector linking onSciDocs\(Qwen↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 29:Vector linking onArguAna\(Qwen↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 30:Vector linking onSciFact\(Qwen↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.Table 31:Vector linking onFiQA\(Qwen↔\\leftrightarrowOpenAI\): each cell reportsprecision/recall/F1\(%\)\. Best values per metric arebolded\.
#### C\.3\.2Additional Results on Out\-of\-Domain Anchors

Figure[11](https://arxiv.org/html/2605.31100#A3.F11)extends the main\-text OOD analysis \(Fig\.[5](https://arxiv.org/html/2605.31100#S5.F5)\) by sweeping the seed budgetn∈\{15,20,30\}n\\in\\\{15,20,30\\\}and the target overlap ratioα∈\{0\.15,0\.2,0\.3\}\\alpha\\in\\\{0\.15,0\.2,0\.3\\\}\. Overall, most reference–target pairs preserve strong precision and recall under OOD seeding; the few degraded cases align with our Theorem[2\.2](https://arxiv.org/html/2605.31100#S2.SS2), which predicts that links supported primarily by long\-range anchors are less reliable\.

![Refer to caption](https://arxiv.org/html/2605.31100v1/x30.png)an=15n\{=\}15,o=0\.15o\{=\}0\.15
![Refer to caption](https://arxiv.org/html/2605.31100v1/x31.png)bn=15n\{=\}15,o=0\.2o\{=\}0\.2
![Refer to caption](https://arxiv.org/html/2605.31100v1/x32.png)cn=15n\{=\}15,o=0\.3o\{=\}0\.3
![Refer to caption](https://arxiv.org/html/2605.31100v1/x33.png)dn=20n\{=\}20,o=0\.15o\{=\}0\.15
![Refer to caption](https://arxiv.org/html/2605.31100v1/x34.png)en=20n\{=\}20,o=0\.2o\{=\}0\.2
![Refer to caption](https://arxiv.org/html/2605.31100v1/x35.png)fn=20n\{=\}20,o=0\.3o\{=\}0\.3
![Refer to caption](https://arxiv.org/html/2605.31100v1/x36.png)gn=30n\{=\}30,o=0\.15o\{=\}0\.15
![Refer to caption](https://arxiv.org/html/2605.31100v1/x37.png)hn=30n\{=\}30,o=0\.2o\{=\}0\.2

Figure 11:Out\-of\-domain reference transfer \(additional settings\):Accuracy \(left\) and recall \(right\) on five target datasets \(columns\) when seeds are drawn from an out\-of\-domain reference dataset \(rows\)\. Each panel varies the number of seedsnnand target overlapoo\. The main text reports the casen=30n\{=\}30,o=0\.3o\{=\}0\.3\(Fig\.[5](https://arxiv.org/html/2605.31100#S5.F5)\)\.

## Appendix DDetails of Section[6](https://arxiv.org/html/2605.31100#S6)

### D\.1Implementation Details of Applications

#### D\.1\.1Vector Database Integration

We follow the evaluation protocol ofYang et al\. \([2025](https://arxiv.org/html/2605.31100#bib.bib44)\): for each benchmark corpusOO, we place all benchmark answer documents only inO1∪O2O\_\{1\}\\cup O\_\{2\}\(i\.e\.,O∩O\_\{\\cap\}contains no answer documents\)\. We then build two vector databasesD1=emb1​\(O1∪O∩\)D\_\{1\}=\\mathrm\{emb\}\_\{1\}\(O\_\{1\}\\cup O\_\{\\cap\}\)andD2=emb2​\(O2∪O∩\)D\_\{2\}=\\mathrm\{emb\}\_\{2\}\(O\_\{2\}\\cup O\_\{\\cap\}\), and evaluate retrieval using queries encoded byemb2\\mathrm\{emb\}\_\{2\}, i\.e\., we query the integrated database withemb2​\(q\)\\mathrm\{emb\}\_\{2\}\(q\)\.

Given vector links\(X,Y\)\(X,Y\)induced byO∩O\_\{\\cap\}\(vectors inD1D\_\{1\}andD2D\_\{2\}that encode the same items\), we compute an integration mappingTTfrom the vector space ofD1D\_\{1\}to that ofD2D\_\{2\}using the local\-isometry\-based framework ofYang et al\. \([2025](https://arxiv.org/html/2605.31100#bib.bib44)\), and return the integrated databaseT​\(D1\)∪D2T\(D\_\{1\}\)\\cup D\_\{2\}\. LetQQbe the benchmark query set, and for each queryq∈Qq\\in Qletansq\\mathrm\{ans\}\_\{q\}denote its ground\-truth relevant set\. Lettop​\-​k​\(q\)\\mathrm\{top\}\\text\{\-\}k\(q\)be the top\-kkresults returned by searching the integrated database withemb2​\(q\)\\mathrm\{emb\}\_\{2\}\(q\)\. We report:

Recall​@​k=1\|Q\|​∑q∈Q\|ansq∩top​\-​k​\(q\)\|\|ansq\|\.\\mathrm\{Recall@\}k\\;=\\;\\frac\{1\}\{\|Q\|\}\\sum\_\{q\\in Q\}\\frac\{\|\\mathrm\{ans\}\_\{q\}\\cap\\mathrm\{top\}\\text\{\-\}k\(q\)\|\}\{\|\\mathrm\{ans\}\_\{q\}\|\}\.For rank\-sensitive evaluation, we also report NDCG:

NDCG​@​k=DCG​@​kIDCG​@​k,DCG​@​k=∑i=1krelilog2⁡\(i\+1\),\\mathrm\{NDCG@\}k\\;=\\;\\frac\{\\mathrm\{DCG@\}k\}\{\\mathrm\{IDCG@\}k\},\\qquad\\mathrm\{DCG@\}k\\;=\\;\\sum\_\{i=1\}^\{k\}\\frac\{\\mathrm\{rel\}\_\{i\}\}\{\\log\_\{2\}\(i\+1\)\},wherereli\\mathrm\{rel\}\_\{i\}is the graded relevance of the item at rankiiandIDCG​@​k\\mathrm\{IDCG@\}kis the DCG of the ideal ranking\.

We use a FAISS GPU index with inner\-product search overℓ2\\ell\_\{2\}\-normalized embeddings \(equivalently cosine similarity\)\.

#### D\.1\.2Global Cross\-Model Clustering

We evaluate cross\-model clustering using two clustering benchmarks from MTEB\(Enevoldsen et al\.,[2025](https://arxiv.org/html/2605.31100#bib.bib10)\)\. Both datasets consist of short*titles*and provide gold cluster labels \(e\.g\., subreddit or StackExchange community\)\. Dataset statistics are summarized in Table[32](https://arxiv.org/html/2605.31100#A4.T32)\.

Table 32:Cross\-model clustering datasets from MTEB \(test split metadata\)\.Lengths are measured in characters per title\.For a dataset with texts\{tℓ\}ℓ=1N\\\{t\_\{\\ell\}\\\}\_\{\\ell=1\}^\{N\}and gold labels\{yℓ\}\\\{y\_\{\\ell\}\\\}, we generate two embedding sets𝐄1=\{eℓ\(1\)\}\\mathbf\{E\}\_\{1\}=\\\{e^\{\(1\)\}\_\{\\ell\}\\\}and𝐄2=\{eℓ\(2\)\}\\mathbf\{E\}\_\{2\}=\\\{e^\{\(2\)\}\_\{\\ell\}\\\}using two embedding models \(e\.g\.,QwenandKaLM\)\. We then create a partial\-overlap partition by selecting index setsℐ1,ℐ2⊆\[N\]\\mathcal\{I\}\_\{1\},\\mathcal\{I\}\_\{2\}\\subseteq\[N\]such thatℐ∩=ℐ1∩ℐ2\\mathcal\{I\}\_\{\\cap\}=\\mathcal\{I\}\_\{1\}\\cap\\mathcal\{I\}\_\{2\}contains the shared items andℐ1∖ℐ2\\mathcal\{I\}\_\{1\}\\setminus\\mathcal\{I\}\_\{2\},ℐ2∖ℐ1\\mathcal\{I\}\_\{2\}\\setminus\\mathcal\{I\}\_\{1\}are model\-specific items\. The ground\-truth correspondence set is𝒫⋆=\{\(ℓ,ℓ\):ℓ∈ℐ∩\}\\mathcal\{P\}^\{\\star\}=\\\{\(\\ell,\\ell\):\\ell\\in\\mathcal\{I\}\_\{\\cap\}\\\}; predicted correspondences𝒫^\\hat\{\\mathcal\{P\}\}are produced by a linking method\.

Graph construction\.For each embedding space, we build akk\-NN graphG1G\_\{1\}on\{eℓ\(1\):ℓ∈ℐ1\}\\\{e^\{\(1\)\}\_\{\\ell\}:\\ell\\in\\mathcal\{I\}\_\{1\}\\\}andG2G\_\{2\}on\{eℓ\(2\):ℓ∈ℐ2\}\\\{e^\{\(2\)\}\_\{\\ell\}:\\ell\\in\\mathcal\{I\}\_\{2\}\\\}\(cosine similarity\)\. We choosekkadaptively to ensure connectivity, by increasingkkuntil the largest connected component covers at least95%95\\%of nodes\. We then form a unified graphGGby merging each correspondence pair\(ℓ1,ℓ2\)∈𝒫^\(\\ell\_\{1\},\\ell\_\{2\}\)\\in\\hat\{\\mathcal\{P\}\}into a single super\-node that inherits the incident edges from bothG1G\_\{1\}andG2G\_\{2\}\. When no correspondences are provided,GGis simply the disjoint union ofG1G\_\{1\}andG2G\_\{2\}\.

Clustering\.We apply Leiden community detection onGG\. To make comparisons fair across methods, we use the same graph\-based clustering pipeline throughout and tune the Leiden resolution by binary search to match the known number of gold clusters in the evaluated set\.

We report : \(i\)Full\-spacesingle\-model clustering on each complete embedding space independently \(QwenandKaLM\), serving as optimal references; \(ii\)Concat, which zero\-pads embeddings to a common dimension and concatenates them without using any cross\-space correspondences; \(iii\)Seed, which stitches the twokk\-NN graphs by node\-merging using only ground\-truth seed correspondences; and \(iv\)Ours, which performs the same node\-merging procedure using predicted correspondences𝒫^\\hat\{\\mathcal\{P\}\}from𝖦𝖤𝖧\\mathsf\{GEH\}\.

### D\.2Experimental Results on Cross Model Clustering

We report cross\-model clustering results forQwen↔\\leftrightarrowKaLMat overlap ratiosα∈\{0\.2,0\.3\}\\alpha\\in\\\{0\.2,0\.3\\\}and seed budgetsn∈\{20,30\}n\\in\\\{20,30\\\}\. We evaluate clustering quality using V\-measure, NMI, and ARI, and additionally report*Overlap Agreement Rate*\(OAR\), defined as the fraction of overlapped items whose two embeddings \(one from each space\) are assigned to the same community in the unified clustering\. OAR is not reported for naive concatenation since it produces a disjoint union of the two graphs and does not induce cross\-space communities\. As shown in Table[33](https://arxiv.org/html/2605.31100#A4.T33), using only seed correspondences yields limited cross\-space connectivity and suboptimal global coherence, whereas using𝖦𝖤𝖧\\mathsf\{GEH\}to stitch the graphs achieves high cross\-space coupling \(OAR=75=75–98%98\\%\) and recovers cluster quality within≈1%\\approx 1\\%of single\-space performance\.

Table 33:Cross\-model clustering performance for Qwen↔\\leftrightarrowKaLM embeddings\.Each cell reports V\-measure / NMI / ARI \(%\)\. OAR = Overlap Agreement Rate \(%\)\.Boldindicates best per metric among Concat/Seed/OURS for each configuration\.

Similar Articles

Stateful Visual Encoders for Vision-Language Models

Hugging Face Daily Papers

This paper introduces a stateful visual encoder for vision-language models that conditions visual representations on prior features, enabling better visual comparison in multi-image and agentic settings. The method shows consistent improvements across tasks such as cross-image spatial aggregation and longitudinal radiology.

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

Hugging Face Daily Papers

This paper introduces a post-training framework that leverages 3D priors from SAM3D to improve semantic correspondence in 2D foundation features, addressing issues like left-right confusion and repeated parts. The method uses instance-specific 3D reconstruction without pose annotations or spherical geometry shortcuts.