jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition
Summary
This paper introduces jina-embeddings-v5-omni, a suite of multimodal embedding models that extend text embeddings to image, audio, and video using frozen-tower composition. The method trains only 0.35% of the total weights, maintaining text geometry while achieving competitive state-of-the-art performance with significantly lower computational cost.
View Cached Full Text
Cached at: 05/12/26, 06:43 AM
# jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition
Source: [https://arxiv.org/html/2605.08384](https://arxiv.org/html/2605.08384)
,Michael Günther,Andreas Koukounas,Kalim Akram,Scott Martens,Saba SturuaandHan Xiao
###### Abstract\.
In this work, we introducefrozen\-encoder model composition, a novel approach to multimodal embedding models\. We build on the VLM\-style architecture, in which non\-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input\. We present the result: thejina\-embeddings\-v5\-omnisuite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space\. Our method is to extend the two Jina Embeddings v5 Text models to support additional media by adding encoders for images and audio\. The backbone text embedding models and the added non\-text media encoders remain frozen\. We only trained the connecting components, representing 0\.35% of the total weights of the joint model\. Training is therefore much more efficient than full\-parameter retraining\. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models\. Our evaluations show that this approach produces results that are competitive with the state\-of\-the\-art, yielding nearly equal performance to larger multimodal embedding models\.
All authors are with Jina by Elastic\. Contact:research@jina\.ai
††copyright:none††ccs:Information systems Multimedia and multimodal retrieval††ccs:Computing methodologies Image representations††ccs:Computing methodologies Machine learning## 1\.Introduction
Text embedding models anchor retrieval, retrieval\-augmented generation \(RAG\)\(Lewis et al\.,[2020](https://arxiv.org/html/2605.08384#bib.bib19)\), and classification pipelines whose vector indexes depend on a stable embedding geometry\. At the same time, search workloads increasingly require images, including screenshots, page scans, infographics, and other rendered media; audio, such as speech, music, and natural sounds; as well as video, to be queried alongside text\.\(Xiao et al\.,[2025b](https://arxiv.org/html/2605.08384#bib.bib37); Macé et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib25); Jiang et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib14); El Assadi et al\.,[2026](https://arxiv.org/html/2605.08384#bib.bib8)\)
Figure 1\.Average performance across multimodal embedding tasks versus model parameter count \(see Table[1](https://arxiv.org/html/2605.08384#S5.T1)\)\.A one\-column frontier chart with Table 1 average scores for six open\-weight omni models: jina\-v5\-omni\-nano, jina\-v5\-omni\-small, LanguageBind, LCO\-3B, LCO\-7B, and Nem\-3B\.Figure 2\.Architecture ofjina\-embeddings\-v5\-omni\(jina\-embeddings\-v5\-omni\-smallshown;jina\-embeddings\-v5\-omni\-nanouses a smaller ViT and LLaVA\-style tokens\)\. Frozen towers feed trainable modality projectors into the frozen text backbone; task\-specific exports select one projector/delimiter set and the matching LoRA adapter\.We presentjina\-embeddings\-v5\-omni, a pair of models that extends a text embedding backbone to image, video, and audio while leaving the model entirely unchanged for text inputs\. The two models differ substantially in size:jina\-embeddings\-v5\-omni\-nanois based onjina\-embeddings\-v5\-text\-nano, with 0\.24B parameters in its base text\-only model, andjina\-embeddings\-v5\-omni\-small, based onjina\-embeddings\-v5\-text\-smallwith 0\.67B parameters\.\(Akram et al\.,[2026](https://arxiv.org/html/2605.08384#bib.bib2)\)The two base models have already been trained for high\-performance text embeddings, using LoRA adapters to optimize them for multiple tasks: retrieval, text\-matching, clustering, and classification\.
To add support for non\-text modalities, we integrate:
- •Vision encoders from Qwen3\.5\-2B and Qwen3\.65\-0\.8\(Qwen Team,[2026](https://arxiv.org/html/2605.08384#bib.bib27)\), which have been adapted from SigLIP2 So400m and SigLIP2 Base respectively\.\(Tschannen et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib33)\)
- •The Qwen2\.5\-Omni audio encoder,\(Chu et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib7)\)which has been adapted from Whisper\-large\-v3\.\(Radford et al\.,[2023](https://arxiv.org/html/2605.08384#bib.bib29)\)
The core idea of*frozen\-encoder model composition*is to use independently pretrained, language\-aligned encoders and align them to text embedding models through small trainable projectors rather than jointly retraining them\. This makes it possible to readily construct modular multimodal embedding models while minimizing added parameters and additional training\.
##### Contributions\.
1. \(1\)We describefrozen\-encoder model compositionand apply it in the construction of thejina\-embeddings\-v5\-omnimodel suite by extending the Jina Embeddings v5 Text suite to support other media\.
2. \(2\)We contribute to the open embedding ecosystem by releasing thejina\-embeddings\-v5\-omnimodel collection111[Jina Embeddings v5 Omni Hugging Face collection](https://huggingface.co/collections/jinaai/jina-embeddings-v5-omni-69f336b985c156b1d757029e)\., comprising two base models and eight task\-specific variants for retrieval, classification, clustering, and text\-matching across Small and Nano scales\.
3. \(3\)We evaluatejina\-embeddings\-v5\-omniand comparable models across a range of standard benchmarks, and show that our approach produces competitive results\. \(See Figure[1](https://arxiv.org/html/2605.08384#S1.F1)\.\)
4. \(4\)We analyze the design rules behind the recipe through ablations on projector training, encoder choice, and Matryoshka truncation, and separately quantify training efficiency\.
## 2\.Related Work
Text\-only embedding models are long established for retrieval and RAG systems, from bidirectional encoders such as Sentence\-BERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.08384#bib.bib30)\)and GTE\-Qwen2\(Alibaba Tongyi Lab,[2024](https://arxiv.org/html/2605.08384#bib.bib3)\)to LLM\-based text\-only embedding models such as E5\-Mistral\(Wang et al\.,[2024b](https://arxiv.org/html/2605.08384#bib.bib34)\)and NV\-Embed\(Lee et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib18)\)\. Jina Embeddings v5 Text\(Akram et al\.,[2026](https://arxiv.org/html/2605.08384#bib.bib2)\)draws on this tradition: a state\-of\-the\-art model family with task\-conditioned LoRA adapters and support for truncation with low performance loss due to Matryoshka representation learning\(Kusupati et al\.,[2022](https://arxiv.org/html/2605.08384#bib.bib17)\)\.
CLIP\(Radford et al\.,[2021](https://arxiv.org/html/2605.08384#bib.bib28)\)established contrastive image–text embedding with separately encoded image and text towers, and SigLIP\(Zhai et al\.,[2023](https://arxiv.org/html/2605.08384#bib.bib39)\), SigLIP2\(Tschannen et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib33)\), and EVA\-CLIP\(Fang et al\.,[2023](https://arxiv.org/html/2605.08384#bib.bib11)\)refine this paradigm through improved losses, data, and visual training recipes\. ImageBind\(Girdhar et al\.,[2023](https://arxiv.org/html/2605.08384#bib.bib12)\)extends contrastive alignment to additional modalities\. Jina CLIP v1/v2\(Koukounas et al\.,[2024b](https://arxiv.org/html/2605.08384#bib.bib16),[a](https://arxiv.org/html/2605.08384#bib.bib15)\)maintains text\-embedding performance in CLIP\-style models, while supporting other media\. However, contrastively\-trained multimodal embedders suffer from a gap between modality\-specific regions of the shared representation space\(Liang et al\.,[2022](https://arxiv.org/html/2605.08384#bib.bib22)\)\.
VLM\-style architectures tackle this challenge by passing the outputs of non\-text media encoders through the same language model as the text token representations\. These models, including LLaVA\(Liu et al\.,[2023](https://arxiv.org/html/2605.08384#bib.bib23)\), BLIP\-2\(Li et al\.,[2023](https://arxiv.org/html/2605.08384#bib.bib20)\), Qwen2\-VL\(Wang et al\.,[2024a](https://arxiv.org/html/2605.08384#bib.bib35)\), and Qwen3\-VL\(Bai et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib4)\), use projectors or connector modules to connect the encoders to the language model\. Embedding models derived from VLMs, like E5\-V\(Jiang et al\.,[2024](https://arxiv.org/html/2605.08384#bib.bib13)\), GME\(Zhang et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib41)\), and Qwen3\-VL\-Embedding\(Li et al\.,[2026](https://arxiv.org/html/2605.08384#bib.bib21)\), demonstrate strong multimodal retrieval performance, but involve adapting the language model, non\-text media encoders, or both\.
Omni\-style systems train or align multiple modalities jointly, supporting video and audio in addition to images, for example, E5\-Omni\(Chen et al\.,[2026](https://arxiv.org/html/2605.08384#bib.bib5)\), WAVE\(Tang et al\.,[2026](https://arxiv.org/html/2605.08384#bib.bib32)\), and LCO\-Embedding\-Omni\(Xiao et al\.,[2025a](https://arxiv.org/html/2605.08384#bib.bib36)\)\.
We take note of previous work in frozen\-tower methods based on the CLIP architecture, such as LiT\(Zhai et al\.,[2022](https://arxiv.org/html/2605.08384#bib.bib40)\)and Nomic Embed Vision\(Nussbaum et al\.,[2024](https://arxiv.org/html/2605.08384#bib.bib26)\), which freeze the text encoder while adapting the other media towers\. To the best of our knowledge, there is no previously published work extending frozen text embedding models to support non\-text media using a VLM\-style architecture\.
## 3\.Architecture
Figure[2](https://arxiv.org/html/2605.08384#S1.F2)summarizes the architecture of thejina\-embeddings\-v5\-omnimodels\. We extend Jina Embeddings v5 Text from text\-only embedding to vision and audio by adding scale\-matched Qwen3\.5 vision encoders222jina\-embeddings\-v5\-omni\-smallusesQwen/Qwen3\.5\-2B;jina\-embeddings\-v5\-omni\-nanousesQwen/Qwen3\.5\-0\.8B\.and the Qwen2\.5\-Omni audio encoder to the same text\-sequence backbone\. We chose encoders from trained multimodal language systems rather than bare perceptual encoders such as SigLIP2 or Whisper\-large because prior work shows that visual and audio features need explicit language\-space alignment or natural\-language supervision before they transfer reliably to text\-conditioned multimodal tasks\(Chen et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib6); Elizalde et al\.,[2023](https://arxiv.org/html/2605.08384#bib.bib9); Qwen Team,[2026](https://arxiv.org/html/2605.08384#bib.bib27); Chu et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib7)\)\. The text processing path ofjina\-embeddings\-v5\-omniis identical to Jina Embeddings v5 Text: Token embeddings pass through the frozen text transformer, the inherited task LoRA adapter is applied, and the final embedding is produced by last\-token pooling and L2 normalization\.
### 3\.1\.Projectors
jina\-embeddings\-v5\-omniuses image and audio encoders extracted from Qwen3\.5 and Qwen2\.5\-Omni, respectively\. These encoders do not produce output that matches the dimensionality of Jina Embeddings v5 Text’s input\. As a result, we replaced them\. So we replaced them with projectors that match Jina Embeddings v5 Text’s input specifications\. For audio, we inserted a randomly\-initializedfc\_audiolayer that projects the encoder’s native12801280dimension output intojina\-embeddings\-v5\-omni\-small’s10241024\-dimension input space andjina\-embeddings\-v5\-omni\-nano’s768768\-dimension one\.
We write each fully connected layer as the same affine map
ℓW,𝐛\(𝐱\)=W𝐱\+𝐛,\\ell\_\{W,\\mathbf\{b\}\}\(\\mathbf\{x\}\)=W\\mathbf\{x\}\+\\mathbf\{b\},with layer\-specific weights and bias\. Thusfc\_vision\_1isℓWv1,𝐛v1\\ell\_\{W\_\{\\text\{v1\}\},\\mathbf\{b\}\_\{\\text\{v1\}\}\},fc\_vision\_2isℓWv2,𝐛v2\\ell\_\{W\_\{\\text\{v2\}\},\\mathbf\{b\}\_\{\\text\{v2\}\}\}, andfc\_audioisℓWaud,𝐛aud\\ell\_\{W\_\{\\text\{aud\}\},\\mathbf\{b\}\_\{\\text\{aud\}\}\}\.
For vision, the Qwen3\.5 visual projector converts ViT patch tokens into text\-token features by applying LayerNorm, a2×22\{\\times\}2spatial merge,fc\_vision\_1, GELU, andfc\_vision\_2\. Here, LayerNorm denotes feature normalization on the ViT patch tokens\. The2×22\{\\times\}2spatial merge is a fixed space\-to\-depth \(pixel\-unshuffle\) rearrangement that concatenates four neighboring patch embeddings into one4dvit4d\_\{\\text\{vit\}\}vector, reducing the spatial token count by4×4\\times; it is the inverse direction of pixel shuffle/sub\-pixel rearrangement\(Shi et al\.,[2016](https://arxiv.org/html/2605.08384#bib.bib31)\)and follows Qwen’s visual\-merger design\(Wang et al\.,[2024a](https://arxiv.org/html/2605.08384#bib.bib35); Qwen Team,[2026](https://arxiv.org/html/2605.08384#bib.bib27)\)\. For each group of four neighboring patch tokens𝐕i=\[𝐯i,1,…,𝐯i,4\]∈ℝ4×dvit\\mathbf\{V\}\_\{i\}=\[\\mathbf\{v\}\_\{i,1\},\\ldots,\\mathbf\{v\}\_\{i,4\}\]\\in\\mathbb\{R\}^\{4\\times d\_\{\\text\{vit\}\}\}, the vision projector produces
𝐦vis\(i\)\\displaystyle\\mathbf\{m\}^\{\(i\)\}\_\{\\text\{vis\}\}=\[LayerNorm\(𝐯i,1\);…;LayerNorm\(𝐯i,4\)\]∈ℝ4dvit,\\displaystyle=\\bigl\[\\text\{LayerNorm\}\(\\mathbf\{v\}\_\{i,1\}\);\\ldots;\\text\{LayerNorm\}\(\\mathbf\{v\}\_\{i,4\}\)\\bigr\]\\in\\mathbb\{R\}^\{4d\_\{\\text\{vit\}\}\},𝐳vis\(i\)\\displaystyle\\mathbf\{z\}^\{\(i\)\}\_\{\\text\{vis\}\}=GELU\(ℓWv1,𝐛v1\(𝐦vis\(i\)\)\),\\displaystyle=\\text\{GELU\}\\\!\\left\(\\ell\_\{W\_\{\\text\{v1\}\},\\mathbf\{b\}\_\{\\text\{v1\}\}\}\(\\mathbf\{m\}^\{\(i\)\}\_\{\\text\{vis\}\}\)\\right\),𝐡vis\(i\)\\displaystyle\\mathbf\{h\}^\{\(i\)\}\_\{\\text\{vis\}\}=ℓWv2,𝐛v2\(𝐳vis\(i\)\),i=1,…,Nvis\.\\displaystyle=\\ell\_\{W\_\{\\text\{v2\}\},\\mathbf\{b\}\_\{\\text\{v2\}\}\}\(\\mathbf\{z\}^\{\(i\)\}\_\{\\text\{vis\}\}\),\\qquad i=1,\\ldots,N\_\{\\text\{vis\}\}\.
Onlyfc\_vision\_2performs the dimension\-specific projection into a text hidden space: in the 2B source checkpoint it maps4096→20484096\{\\to\}2048into the Qwen3\.5\-2B text hidden dimension, and in the 0\.8B source checkpoint it maps3072→10243072\{\\to\}1024into the Qwen3\.5\-0\.8B text hidden dimension\. These targets do not match Small’s10241024\-dimensional or Nano’s768768\-dimensional Jina text backbone, so we keep LayerNorm andfc\_vision\_1frozen but replacefc\_vision\_2with a randomly initialized4096→10244096\{\\to\}1024layer for Small and3072→7683072\{\\to\}768layer for Nano\.
Let𝐀=\[𝐚1,…,𝐚K\]∈ℝK×1280\\mathbf\{A\}=\[\\mathbf\{a\}\_\{1\},\\ldots,\\mathbf\{a\}\_\{K\}\]\\in\\mathbb\{R\}^\{K\\times 1280\}denote the frozen Qwen2\.5\-Omni audio encoder states for an input withKKaudio tokens\. Each audio token is independently projected into the Jina text hidden dimension byfc\_audio
𝐡aud\(i\)=ℓWaud,𝐛aud\(𝐚i\),i=1,…,K,\\mathbf\{h\}^\{\(i\)\}\_\{\\text\{aud\}\}=\\ell\_\{W\_\{\\text\{aud\}\},\\mathbf\{b\}\_\{\\text\{aud\}\}\}\(\\mathbf\{a\}\_\{i\}\),\\qquad i=1,\\ldots,K,whereWaud∈ℝdtext×1280W\_\{\\text\{aud\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{text\}\}\\times 1280\}anddtext∈\{1024,768\}d\_\{\\text\{text\}\}\\in\\\{1024,768\\\}for Small and Nano\.
### 3\.2\.Input Sequence Construction
Each input is serialized as one sequence of tokens\. Text remains ordinary text tokens; non\-text modalities are represented by placeholder runs inside modality delimiters\. An image is encoded as
<\|vision\_start\|\><\|image\_pad\|\>×N⏟visual slots<\|vision\_end\|\>\\texttt\{<\|vision\\\_start\|\>\}\\;\\;\\underbrace\{\\texttt\{<\|image\\\_pad\|\>\}\\times N\}\_\{\\text\{visual slots\}\}\\;\\;\\texttt\{<\|vision\\\_end\|\>\}withNNvisual slots\. An audio input is encoded as
<\|audio\_start\|\><\|audio\_pad\|\>×K⏟audio slots<\|audio\_end\|\>\\texttt\{<\|audio\\\_start\|\>\}\\;\\;\\underbrace\{\\texttt\{<\|audio\\\_pad\|\>\}\\times K\}\_\{\\text\{audio slots\}\}\\;\\;\\texttt\{<\|audio\\\_end\|\>\}withKKaudio slots\. A video is a concatenation of one visual segment per sampled frame:
∥f=1F\(<\|vision\_start\|\><\|video\_pad\|\>×Sf⏟framefslots<\|vision\_end\|\>\),\\big\\\|\_\{f=1\}^\{F\}\\left\(\\texttt\{<\|vision\\\_start\|\>\}\\;\\;\\underbrace\{\\texttt\{<\|video\\\_pad\|\>\}\\times S\_\{f\}\}\_\{\\text\{frame \}f\\text\{ slots\}\}\\;\\;\\texttt\{<\|vision\\\_end\|\>\}\\right\),where∥\\\|denotes sequence concatenation\. If a video contains an audio track, the extracted audio segment precedes the frame sequence:
𝐬aud∥𝐬vid\.\\mathbf\{s\}\_\{\\text\{aud\}\}\\\|\\mathbf\{s\}\_\{\\text\{vid\}\}\.Here,𝐬aud\\mathbf\{s\}\_\{\\text\{aud\}\}is the audio sequence above and𝐬vid\\mathbf\{s\}\_\{\\text\{vid\}\}is the video\-frame sequence\. For mixed\-modality inputs, text spans and modality segments are concatenated in document order\.
### 3\.3\.Trainable Parameters
The trainable set isfc\_vision\_2,fc\_audio, and the modality\-delimiter embeddings\.jina\-embeddings\-v5\-omni\-smalllearns the vision and audio start/end delimiter embeddings used in Section[3\.2](https://arxiv.org/html/2605.08384#S3.SS2);jina\-embeddings\-v5\-omni\-nanolearns only the audio start/end delimiter embeddings\. The image, video, and audio placeholder positions are overwritten by projected encoder features rather than learned as standalone token embeddings\. Projector and delimiter\-token training is run separately for retrieval, text\-matching, clustering, and classification, while the text transformer, encoder towers, LayerNorm/fc\_vision\_1vision\-projector weights, and inherited LoRA adapters stay frozen\. The base package stores four such task\-specific sets alongside the inherited LoRA adapters\.
### 3\.4\.Dynamic Weight Loading
Jina Embeddings v5 Text already uses dynamic adapter selection to route retrieval, classification, clustering, and text\-matching inputs through the corresponding task adapter\. We extend the same task\-selection mechanism to the multimodal weights: the selected task variant determines which LoRA adapter,fc\_vision\_2,fc\_audio, and learned special text\-token embeddings are loaded or activated\. The task\-specific projector and delimiter\-token weights therefore follow the same task\-specific variation as Jina Embeddings v5 Text\. Separately, the model exposes a modality attribute that controls which frozen modality towers are instantiated: text\-only loading omits both vision and audio towers, vision\-only loading omits the audio tower andfc\_audio, audio\-only loading omits the vision tower and vision projector, and omni loading keeps both vision and audio towers\.
## 4\.Training
Projector training uses bidirectional in\-batch InfoNCE with Matryoshka representation learning\. For a batch ofBBpaired examples\{\(ℓi,ri\)\}i=1B\\\{\(\\ell\_\{i\},r\_\{i\}\)\\\}\_\{i=1\}^\{B\}, let𝐮i\\mathbf\{u\}\_\{i\}and𝐯i\\mathbf\{v\}\_\{i\}be the left and right embeddings, and let𝐮i,1:k\\mathbf\{u\}\_\{i,1:k\}denote the firstkkdimensions\. With temperatureτ=0\.02\\tau=0\.02,
sij\(k\)\\displaystyle s\_\{ij\}^\{\(k\)\}=cos\(𝐮i,1:k,𝐯j,1:k\)τ,\\displaystyle=\\frac\{\\cos\(\\mathbf\{u\}\_\{i,1:k\},\\mathbf\{v\}\_\{j,1:k\}\)\}\{\\tau\},pℓ→r\(k\)\(j\|i\)\\displaystyle p\_\{\\ell\\to r\}^\{\(k\)\}\(j\|i\)=exp\(sij\(k\)\)∑m=1Bexp\(sim\(k\)\),\\displaystyle=\\frac\{\\exp\(s\_\{ij\}^\{\(k\)\}\)\}\{\\sum\_\{m=1\}^\{B\}\\exp\(s\_\{im\}^\{\(k\)\}\)\},pr→ℓ\(k\)\(j\|i\)\\displaystyle p\_\{r\\to\\ell\}^\{\(k\)\}\(j\|i\)=exp\(sji\(k\)\)∑m=1Bexp\(smi\(k\)\)\.\\displaystyle=\\frac\{\\exp\(s\_\{ji\}^\{\(k\)\}\)\}\{\\sum\_\{m=1\}^\{B\}\\exp\(s\_\{mi\}^\{\(k\)\}\)\}\.ℒNCE\(k\)=−12B∑i=1B\[logpℓ→r\(k\)\(i\|i\)\+logpr→ℓ\(k\)\(i\|i\)\]\.\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{\(k\)\}=\-\\frac\{1\}\{2B\}\\sum\_\{i=1\}^\{B\}\\left\[\\log p\_\{\\ell\\to r\}^\{\(k\)\}\(i\|i\)\+\\log p\_\{r\\to\\ell\}^\{\(k\)\}\(i\|i\)\\right\]\.The training loss sums this term over Matryoshka prefix dimensions,
ℒ\\displaystyle\\mathcal\{L\}=∑k∈𝒦ℒNCE\(k\),\\displaystyle=\\sum\_\{k\\in\\mathcal\{K\}\}\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{\(k\)\},𝒦Small\\displaystyle\\mathcal\{K\}\_\{\\mathrm\{Small\}\}=\{32,64,128,256,512,768,1024\},\\displaystyle=\\\{2,4,28,56,12,68,024\\\},𝒦Nano\\displaystyle\\mathcal\{K\}\_\{\\mathrm\{Nano\}\}=\{32,64,128,256,512,768\}\.\\displaystyle=\\\{2,4,28,56,12,68\\\}\.
We use the AdamW optimizer\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.08384#bib.bib24)\)withβ1=0\.9\\beta\_\{1\}\{=\}0\.9,β2=0\.999\\beta\_\{2\}\{=\}0\.999, weight decay0\.010\.01, and global gradient clipping at∥∇∥2≤1\\lVert\\nabla\\rVert\_\{2\}\\leq 1\. The learning rate is2⋅10−42\{\\cdot\}10^\{\-4\}with500500linear warmup steps\. Training uses bf16 mixed precision and distributed data parallelism across44NVIDIA H100 GPUs, with global batch size256256paired examples\. For each model size, projector training is run separately for the retrieval, classification, clustering, and text\-matching variants\. Each run uses the corresponding frozen LoRA adapter inherited from Jina Embeddings v5 Text and trains the task\-specificfc\_vision\_2/fc\_audioprojector weights plus the modality\-delimiter token embeddings defined in Section[3\.3](https://arxiv.org/html/2605.08384#S3.SS3)\. The same source mixture is reused across these task\-specific projector runs, and each run is trained for15 00015\\,000optimizer steps\. Each batch contains examples from one source dataset sampled by mixture weight\. Figure[3](https://arxiv.org/html/2605.08384#S4.F3)summarizes the shared projector\-training mixture by token share across semantic data types\. The mixture is full of text\-rich and complex images like scans and diagrams, matching practical enterprise search and RAG systems that operate over real\-world multimodal documents whose layout, images, and OCR/parsing stages affect retrieval quality\(Lewis et al\.,[2020](https://arxiv.org/html/2605.08384#bib.bib19); Yu et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib38)\)\.
\(a\) Image \(token share\)natural photos35\.5%medical imagery30\.3%documents & OCR23\.7%product catalog5\.3%charts & diagrams3\.6%UI & screenshots1\.6%\(b\) Audio \(token share\)music55\.0%environmental sounds25\.5%English speech14\.2%multilingual speech3\.1%animal sounds1\.9%emotional speech0\.2%Figure 3\.Distribution of input*tokens*across semantic data types, averaged over the four task\-specific checkpoints\.
## 5\.Evaluation
We describe each evaluation suite by the types of tasks it covers:
- •Images:The Massive Image Embedding Benchmark \(MIEB\)\(Xiao et al\.,[2025b](https://arxiv.org/html/2605.08384#bib.bib37)\)covers classification, clustering, visual semantic textual similarity \(STS\), retrieval, document retrieval, compositional reasoning, and vision\-centric tasks\.
- •Video:The Massive Multimodal Embedding Benchmark \(MMEB\)\(Jiang et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib14)\)provides a video evaluation suite, MMEB\-Video, covering classification, VQA, retrieval, and moment\-retrieval sub\-tasks\.
- •Audio:The Massive Audio Embedding Benchmark \(MAEB\)\(El Assadi et al\.,[2026](https://arxiv.org/html/2605.08384#bib.bib8)\)covers audio–text and audio\-centric embedding quality, grouped by task type \(retrieval, classification, clustering, text\-matching\)\.
- •Text:The Massive Multilingual Text Embedding Benchmark \(MMTEB\)\(Enevoldsen et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib10)\)evaluates text\-only embedding quality across retrieval, classification, clustering, semantic textual similarity, reranking, and pair\-classification tasks\.Documents\.We report ViDoRe\(Macé et al\.,[2025](https://arxiv.org/html/2605.08384#bib.bib25)\)page\-level retrieval, where embeddings must capture fine layout and small text\.
For text, we report the published MMTEB scores for Jina Embeddings v5 Text, since its behavior is identical tojina\-embeddings\-v5\-omnifor text inputs\.\(Akram et al\.,[2026](https://arxiv.org/html/2605.08384#bib.bib2)\)
Our baselines for comparison consist of open\-weight omni\-style models with support for the same media types: LanguageBind, Omni\-Embed\-Nemotron\-3B, LCO\-Embedding\-Omni\-3B, and LCO\-Embedding\-Omni\-7B\. It also includes some task\-matched specialized models: CLIP/SigLIP\-style and VLM\-derived embedders for vision, Whisper/CLAP\-style embedders for audio, and VLM/video embedding models for video\. Parameter counts are task\-path specific: summaries for omni\-style models count all compared modalities, while modality\-specific rows count only the encoders needed for that task\.
Table 1\.Open\-weight omni\-style model scores on selected evaluation subsets\. Text uses MMTEB; Image, Video, and Audio use aggregate MIEB, MMEB\-Video subseta, and MAEB scores, respectively\.aMMEB\-Video subset: Breakfast, MSR\-VTT, EgoSchema, HMDB51, UCF101, MSVD, SmthSmthV2, DiDeMo, and K700\. Params count the loaded parameters needed for text, image, video, and audio requests; LanguageBind counts one shared language encoder plus the Image, Video\_FT, and Audio\_FT modality paths, not duplicate text copies shipped across the separate checkpoints\. Avg averages the displayed numeric columns\.
Table 2\.Document\-retrieval scores on the ViDoRe\-in\-MIEB subset\.\*Text\+image path parameters for document retrieval; audio/video encoders are not counted\. Subset tasks: DocVQA, InfoVQA, TabFQuAD, TAT\-DQA, ArxivQA, ShiftProject, SyntheticDocQA\-AI, SyntheticDocQA\-Energy, SyntheticDocQA\-HealthcareIndustry, and SyntheticDocQA\-GovernmentReports\.
### 5\.1\.Results
Table[1](https://arxiv.org/html/2605.08384#S5.T1)shows thatjina\-embeddings\-v5\-omni\-smallhas the strongest text\-only performance and the best overall score among models below55B parameters\. Its53\.9353\.93four\-modality average is slightly above LCO\-Embedding\-Omni\-3B \(53\.8353\.83\) and below only the larger LCO\-Embedding\-Omni\-7B score of54\.4354\.43, among comparable omni\-style models\. The same table also contains comparisons by modality\.jina\-embeddings\-v5\-omni\-smallis very strong on text and competitive on images and audio, but video performance lags significantly compared to the baseline models\.
Table[2](https://arxiv.org/html/2605.08384#S5.T2)shows that bothjina\-embeddings\-v5\-omni\-nanoandjina\-embeddings\-v5\-omni\-smallhave strong visual document retrieval performance\.jina\-embeddings\-v5\-omni\-smallscores79\.0879\.08with0\.920\.92B active text\+image\-path parameters, above LCO\-Embedding\-Omni\-3B \(78\.2478\.24\) and close to LCO\-Embedding\-Omni\-7B \(80\.3280\.32\)\.jina\-embeddings\-v5\-omni\-nanoscores70\.0570\.05with0\.310\.31B active parameters, competitive for its size and substantially above LanguageBind on the ViDoRe MIEB subset\.
Table[3](https://arxiv.org/html/2605.08384#S5.T3)gives a detailed breakdown across multiple benchmarks\. The strongestjina\-embeddings\-v5\-omni\-smallperformances are for image classification, image clustering, visual STS, multilingual image retrieval, and audio classification, while generic image retrieval, MMEB\-Video, and audio clustering remain weaker\.
Figures[4](https://arxiv.org/html/2605.08384#S5.F4)and[5](https://arxiv.org/html/2605.08384#S5.F5)show relative performance per language, compared to the average of the baseline models\. Color indicates deviation from the five\-model per\-language mean for image\-language and audio retrieval, respectively\. Figure[4](https://arxiv.org/html/2605.08384#S5.F4)highlights the relatively strong performance ofjina\-embeddings\-v5\-omni\-smallon languages other than English, while Figure[5](https://arxiv.org/html/2605.08384#S5.F5)does the same for audio performance\.
Table 3\.Main benchmark results\. Bold numeric cells mark the row winner amongjina\-embeddings\-v5\-omni\-nano,jina\-embeddings\-v5\-omni\-small, and the strongest open\-weight baseline model; bold row labels are benchmark or slice aggregates, and indented rows are task\-type averages\. The “Strongest open\-weight baseline” column is an orientation point, not a unified controlled ladder\.Benchmark / task type\#TasksNano \(0\.95 B\)Small \(1\.57 B\)Strongest open\-weight baselineParams \(B\)ScoreMIEB Light \(Image\)5042\.3853\.41LCO\-Embedding\-Omni\-3B4\.0761\.63Image classification1544\.1863\.96LCO\-Embedding\-Omni\-3B4\.0759\.07Compositional / vision QA1135\.8840\.79LCO\-Embedding\-Omni\-3B4\.0752\.00Image clustering250\.1881\.87LCO\-Embedding\-Omni\-3B4\.0773\.19Visual STS463\.8374\.17royokong/e5\-v8\.3663\.73Retrieval1221\.7229\.89LCO\-Embedding\-Omni\-3B4\.0783\.44Document retrieval674\.1873\.86LCO\-Embedding\-Omni\-3B4\.0772\.99MIEB \(Image\)11946\.4160\.17siglip\-so400m\-patch14\-3840\.8860\.69Image classification4453\.8968\.55LCO\-Embedding\-Omni\-3B4\.0764\.30Compositional / vision QA1339\.1344\.23LCO\-Embedding\-Omni\-3B4\.0753\.40Image clustering566\.6584\.57LCO\-Embedding\-Omni\-3B4\.0783\.24Visual STS968\.8878\.04LCO\-Embedding\-Omni\-3B4\.0779\.62Retrieval4423\.5838\.53LCO\-Embedding\-Omni\-3B4\.0746\.29Document retrieval1070\.0579\.08Omni\-Embed\-Nemotron\-3B4\.7085\.64MIEB Multilingual only \(Image\)541\.1665\.55LCO\-Embedding\-Omni\-3B4\.0769\.04Visual STS252\.6565\.05LCO\-Embedding\-Omni\-3B4\.0779\.62Retrieval333\.4965\.88LCO\-Embedding\-Omni\-3B4\.0761\.99MMEB\-Video \(Video\)1829\.7339\.83Qwen3\-VL\-Embedding\-8B8\.1467\.15V\-CLS \(classification\)527\.8542\.73Qwen3\-VL\-Embedding\-8B8\.1478\.39V\-QA \(question answering\)539\.0344\.52WeMM\-Embedding\-8B8\.7771\.66V\-RET \(retrieval\)514\.3327\.82Qwen3\-VL\-Embedding\-8B8\.1458\.73V\-MRET \(moment retrieval\)343\.0247\.20Qwen3\-VL\-Embedding\-8B8\.1456\.09MAEB \(Audio\)3042\.4050\.77LCO\-Embedding\-Omni\-7B8\.9352\.37Retrieval / reranking1039\.2453\.56LCO\-Embedding\-Omni\-7B8\.9361\.67Classification / zero\-shot1449\.2555\.89LCO\-Embedding\-Omni\-7B8\.9353\.39Text matching356\.9062\.40LCO\-Embedding\-Omni\-7B8\.9367\.30Clustering36\.445\.99clap\-htsat\-fused0\.1522\.74
MIEB rows exclude RP2kI2IRetrieval∗, SOPI2IRetrieval∗, SciMMIRI2TRetrieval∗, SciMMIRT2IRetrieval∗, and CLEVRCountZeroShot∗;∗denotes MIEB tasks removed because of train–test contamination\. MMEB\-Video uses the full1818\-task suite, including MomentSeeker\.
Figure 4\.XM3600 image\-language comparison\. Tiles showjina\-v5\-omni\-small; color is deviation from a five\-model language mean\.XM3600 language tiles for jina\-v5\-omni\-small compared with Nano, LanguageBind, LCO\-3B, and Nem\-3B; LCO\-7B is discussed by aggregate score\.Figure 5\.Per\-language audio retrieval\. Tiles showjina\-v5\-omni\-smallon shared CommonVoiceMini21/FLEURS languages; color is deviation from the mean of the baseline models\.Audio language tiles for jina\-v5\-omni\-small compared with Nano, LCO\-3B, Nem\-3B, and LanguageBind Audio across the shared CommonVoiceMini21/FLEURS languages\.
## 6\.Ablation Studies
The architecture described in Section[3](https://arxiv.org/html/2605.08384#S3)rests on two design choices: which projector layers to train and whether to update an encoder\. This section uses ablation studies to investigate those choices for the projector\-training recipe\.
### 6\.1\.Trainable Parameters
Runs in this subsection start fromjina\-embeddings\-v5\-omni\-small\-retrieval, use global batch128128\(3232per rank×\\times4×4\\timesH100\), and run for5 0005\\,000optimizer steps\. Image ablations use a fast MIEB subset—CIRR\-IT2I and NIGHTS\-I2I retrieval\. Audio ablations use an 8\-task MAEB subset\. For these experiments, the primary trainable projector is randomly initialized at load time:fc\_vision\_2for vision runs andfc\_audiofor audio runs\. The remaining layers \(encoder, LayerNorm,fc\_vision\_1\) retain their pretrained initialization values\.
#### 6\.1\.1\.Vision
We tested which parts of the Qwen3\.5 vision stack to train, keeping the rest frozen, evaluating four configurations\.
- Ifc\_vision\_2only, lr2⋅10−42\{\\cdot\}10^\{\-4\}\(our configuration\)\.
- IIfc\_vision\_1\+fc\_vision\_2, lr2⋅10−42\{\\cdot\}10^\{\-4\};fc\_vision\_1stays at the Qwen3\.5 initialization,fc\_vision\_2is reset\.
- IIIfc\_vision\_1\+fc\_vision\_2\+ vision encoder, lr1⋅10−51\{\\cdot\}10^\{\-5\}\(dropped20×20\\timesbecause the encoder is unfrozen\)\.
- IVI, thenfc\_vision\_1\+fc\_vision\_2, continuing from the stage\-I checkpoint\.
- VI, thenfc\_vision\_1\+fc\_vision\_2\+ vision encoder, continuing from the stage\-I checkpoint\.
Runs I–III are single\-stage ablations from the same resetfc\_vision\_2\. Runs IV and V are two\-stage continuations that first train run I and then unfreeze additional layers for a second5 0005\\,000\-step stage\.
Figure 6\.Vision ablations tests on CIRR\-IT2I and NIGHTS\-I2I\. PRO isfc\_vision\_2, PRO1/2 isfc\_vision\_1\+fc\_vision\_2, ViT is the vision encoder, and V adds only0\.0010\.001over I\.##### Result:
Figure[6](https://arxiv.org/html/2605.08384#S6.F6)displays the results of these tests\. Thefc\_vision\_2\-only recipe \(I\) is sufficient: it reaches0\.1580\.158, while trainingfc\_vision\_1from the start \(II\) ends slightly lower at0\.1530\.153\. Unfreezing the encoder from step0\(III\) is clearly harmful, ending at0\.0790\.079\. The two\-stage variants test whether I should be followed by a broader continuation stage\. Continuing withfc\_vision\_1\+fc\_vision\_2\(IV\) does not improve the checkpoint, and the broader continuation with the encoder unfrozen \(V\) reaches only0\.1590\.159, an absolute gain of0\.0010\.001over I on this 2\-task subset\. That gain is too small to justify a production recipe with an additional continuation stage and extra task\-specific adapter/projector artifacts for all four variants of each model size, so the released configuration keeps the simpler frozen\-tower choice: trainfc\_vision\_2and leavefc\_vision\_1, the vision encoder, and inherited LoRA adapters fixed\.
#### 6\.1\.2\.Audio
We then tested which parts of the Qwen2\.5\-Omni audio stack to train, keeping the rest frozen, evaluating four configurations\.
- Ifc\_audioonly, lr2⋅10−42\{\\cdot\}10^\{\-4\}\(our configuration\)\.
- IIfc\_audio\+ audio encoder, lr1⋅10−51\{\\cdot\}10^\{\-5\}; starting from the reset projector\.
- IIII, thenfc\_audio\+ audio encoder, continuing from the final I checkpoint, lr1⋅10−51\{\\cdot\}10^\{\-5\}\.
Runs I and II are single\-stage ablations from the same resetfc\_audio\. Run III is a two\-stage continuation that first trains run I and then unfreezes the audio encoder for a second5 0005\\,000\-step stage\.
Figure 7\.Audio ablation tests on UrbanSound8K, CommonVoiceMini21, MACS, GigaSpeech, SpokenSQuAD, Clotho, JamAlt Artist, and JamAlt Lyric\. PRO isfc\_audio, AUD is the audio encoder, and III adds about0\.0220\.022over I\.##### Result:
Figure[7](https://arxiv.org/html/2605.08384#S6.F7)displays the results of these tests\. Thefc\_audio\-only recipe \(I\) is sufficient for this budget: it reaches0\.3980\.398, while unfreezing the audio encoder from step0\(II\) ends lower at0\.3670\.367\. The two\-stage variant tests whether I should be followed by a broader continuation stage\. Continuing withfc\_audio\+audio encoder \(III\) reaches0\.4190\.419, an absolute gain of0\.0220\.022over I\. We therefore keep the released recipe frozen for simplicity, while treating audio\-encoder adaptation as a promising future training stage\.
### 6\.2\.Matryoshka Preservation Across Modalities
Figure 8\.Matryoshka prefix tests across modalities\. Curves show mean nDCG@10; line style indicates modality and color shade indicates model size\.Line chart of mean nDCG@10 versus Matryoshka truncation dimension for text, image, audio, and video retrieval, with small and nano curves for each modality\.Figure[8](https://arxiv.org/html/2605.08384#S6.F8)shows Matryoshka performance under embedding truncation\. Image embeddings behave similarly to text ones: bothjina\-embeddings\-v5\-omni\-smallandjina\-embeddings\-v5\-omni\-nanolose roughly0\.180\.18–0\.210\.21nDCG@10 when truncated to3232dimensions\. Audio also preserves most of its score at256256dimensions, while video degrades much more heavily at small dimensions, indicating weaker Matryoshka preservation for video embeddings\.
### 6\.3\.Training Efficiency
This ablation test measures the efficiency gained by updating only the projector path rather than doing full training\. Table[4](https://arxiv.org/html/2605.08384#S6.T4)shows that projector training makes vision runs1\.8×1\.8\\timesfaster and audio runs3\.23\.2–3\.9×3\.9\\timesfaster at the1515k\-step budget, with lower peak GPU memory in every case\.
Table 4\.Training throughput and peak GPU memory\.
## 7\.Conclusion
We introducefrozen\-encoder model composition, a novel approach to constructing multimodal embedding models by connecting frozen pre\-trained modality\-specific encoders directly to a frozen text embedding model via compact and easily trained projectors\. The result of this research, thejina\-embeddings\-v5\-omnimodel suite, is also presented\. These models add vision and audio to the Jina Embeddings v5 Text models, yielding a competitive set of models for broad cross\-modality applications\. Using this recipe, text\-only embedding models that were never trained on vision or audio can be extended to photos, documents, video, speech, music, and sounds by training a single projector layer per modality while preserving text\-only performance\.
jina\-embeddings\-v5\-omni\-smallis the best\-performing open\-weight embedding model below22B parameters that supports text, audio, images, and video\. Against a baseline of comparable models, including modality\-specific and VLM\-derived embedders, it is particularly strong on visual document retrieval\.jina\-embeddings\-v5\-omni\-smallandjina\-embeddings\-v5\-omni\-nanoextend completely different text embedding models with different backbone architectures, suggesting that frozen\-encoder composition is an extensible strategy with broad application, outside of thejina\-embeddings\-v5\-omnisuite and for additional modalities\. This is a potential subject for future research\.
The ablations suggest that projector\-only alignment can serve as a compatibility\-preserving initialization for rich multimodal training\. Future work will investigate the choice of non\-text encoders, which is inadequately explored in this paper\. Furthermore, an investigation of training options under different conditions is indicated, like jointly training projectors for multiple modalities together\. We also note the strong performance ofjina\-embeddings\-v5\-omnion temporal reasoning and moment retrieval, but poor performance on other video tasks\. We hope to improve performance in this area in future models\.
## References
- \(1\)
- Akram et al\.\(2026\)Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao\. 2026\.jina\-embeddings\-v5\-text: Task\-Targeted Embedding Distillation\.arXiv:2602\.15547 \[cs\.CL\][https://arxiv\.org/abs/2602\.15547](https://arxiv.org/abs/2602.15547)
- Alibaba Tongyi Lab \(2024\)Alibaba Tongyi Lab\. 2024\.gte\-Qwen2: General Text Embeddings Based on Qwen2\.Hugging Face model collection\.[https://huggingface\.co/collections/Alibaba\-NLP/gte\-qwen2](https://huggingface.co/collections/Alibaba-NLP/gte-qwen2)
- Bai et al\.\(2025\)Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu\. 2025\.Qwen3\-VL Technical Report\.arXiv:2511\.21631 \[cs\.CV\][https://arxiv\.org/abs/2511\.21631](https://arxiv.org/abs/2511.21631)
- Chen et al\.\(2026\)Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou\. 2026\.e5\-omni: Explicit Cross\-modal Alignment for Omni\-modal Embeddings\.arXiv:2601\.03666 \[cs\.CL\][https://arxiv\.org/abs/2601\.03666](https://arxiv.org/abs/2601.03666)
- Chen et al\.\(2025\)Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu\-Gang Jiang\. 2025\.CoMP: Continual Multimodal Pre\-training for Vision Foundation Models\.arXiv:2503\.18931 \[cs\.CV\][https://arxiv\.org/abs/2503\.18931](https://arxiv.org/abs/2503.18931)
- Chu et al\.\(2025\)Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Haojie Zhang, Zhijie Gu, Yuxuan Zhou, Jingren Zhou, Junyang Lin, and Chang Zhou\. 2025\.Qwen2\.5\-Omni Technical Report\.arXiv:2503\.20215 \[cs\.CL\][https://arxiv\.org/abs/2503\.20215](https://arxiv.org/abs/2503.20215)
- El Assadi et al\.\(2026\)Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, and Kenneth Enevoldsen\. 2026\.MAEB: Massive Audio Embedding Benchmark\.arXiv:2602\.16008 \[cs\.SD\][https://arxiv\.org/abs/2602\.16008](https://arxiv.org/abs/2602.16008)
- Elizalde et al\.\(2023\)Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang\. 2023\.CLAP Learning Audio Concepts From Natural Language Supervision\. In*IEEE International Conference on Acoustics, Speech and Signal Processing*\. 1–5\.
- Enevoldsen et al\.\(2025\)Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poświata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Philipp Harries, Loïc Magne, Isabelle Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte Gisserot\-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Šuppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lù, Jordan Clive, Gayatri Krishnakumar, Anna Maksimova, Silvan Wehrli, Maria Tikhonova, Henil Panchal, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Safi, Wen\-Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jimmy Lin, Howard Yen, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy, and Niklas Muennighoff\. 2025\.MMTEB: Massive Multilingual Text Embedding Benchmark\.arXiv:2502\.13595 \[cs\.CL\][https://arxiv\.org/abs/2502\.13595](https://arxiv.org/abs/2502.13595)
- Fang et al\.\(2023\)Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao\. 2023\.EVA\-CLIP: Improved Training Techniques for CLIP at Scale\.arXiv:2303\.15389 \[cs\.CV\][https://arxiv\.org/abs/2303\.15389](https://arxiv.org/abs/2303.15389)
- Girdhar et al\.\(2023\)Rohit Girdhar, Alaaeldin El\-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra\. 2023\.ImageBind: One Embedding Space To Bind Them All\. In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*\. 15180–15190\.
- Jiang et al\.\(2024\)Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang\. 2024\.E5\-V: Universal Embeddings with Multimodal Large Language Models\.arXiv:2407\.12580 \[cs\.CL\][https://arxiv\.org/abs/2407\.12580](https://arxiv.org/abs/2407.12580)
- Jiang et al\.\(2025\)Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen\. 2025\.MMEB: Massive Multi\-discipline Multimodal Embedding Benchmark\.arXiv:2410\.05160 \[cs\.CV\][https://arxiv\.org/abs/2410\.05160](https://arxiv.org/abs/2410.05160)Introduced with VLM2Vec\.
- Koukounas et al\.\(2024a\)Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Mohammad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, and Han Xiao\. 2024a\.jina\-clip\-v2: Multilingual Multimodal Embeddings for Text and Images\.arXiv:2412\.08802 \[cs\.CL\][https://arxiv\.org/abs/2412\.08802](https://arxiv.org/abs/2412.08802)
- Koukounas et al\.\(2024b\)Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, and Han Xiao\. 2024b\.Jina CLIP: Your CLIP Model Is Also Your Text Retriever\.arXiv:2405\.20204 \[cs\.CL\][https://arxiv\.org/abs/2405\.20204](https://arxiv.org/abs/2405.20204)
- Kusupati et al\.\(2022\)Aditya Kusupati, Ashish Bhatt, Matthew Wallingford, Aniruddha Sinha, Vivek Ramanujan, William Howard\-Snyder, Kaifeng Chen, Sham Jain, and Ali Farhadi\. 2022\.Matryoshka Representation Learning\. In*Advances in Neural Information Processing Systems*\.
- Lee et al\.\(2025\)Chien Van Lee, Rajarshi Roy, Mengting Xu, Jonathan Raiman, Mohammad Shoeybi, and Bryan Catanzaro\. 2025\.NV\-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models\.arXiv:2412\.04252 \[cs\.CL\][https://arxiv\.org/abs/2412\.04252](https://arxiv.org/abs/2412.04252)
- Lewis et al\.\(2020\)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\. 2020\.Retrieval\-Augmented Generation for Knowledge\-Intensive NLP Tasks\. In*Advances in Neural Information Processing Systems*, Vol\. 33\. 9459–9474\.[https://proceedings\.neurips\.cc/paper/2020/hash/6b493230\-Abstract\.html](https://proceedings.neurips.cc/paper/2020/hash/6b493230-Abstract.html)
- Li et al\.\(2023\)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi\. 2023\.BLIP\-2: Bootstrapping Language\-Image Pre\-training with Frozen Image Encoders and Large Language Models\. In*Proceedings of the International Conference on Machine Learning*, Vol\. 202\. PMLR, 19730–19742\.
- Li et al\.\(2026\)Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin\. 2026\.Qwen3\-VL\-Embedding and Qwen3\-VL\-Reranker: A Unified Framework for State\-of\-the\-Art Multimodal Retrieval and Ranking\.arXiv:2601\.04720 \[cs\.CL\][https://arxiv\.org/abs/2601\.04720](https://arxiv.org/abs/2601.04720)
- Liang et al\.\(2022\)Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou\. 2022\.Mind the Gap: Understanding the Modality Gap in Multi\-modal Contrastive Representation Learning\. In*Advances in Neural Information Processing Systems*, Vol\. 35\. Curran Associates, Inc\., New Orleans, LA, USA, 17612–17625\.arXiv:2203\.02053 \[cs\.LG\][https://arxiv\.org/abs/2203\.02053](https://arxiv.org/abs/2203.02053)
- Liu et al\.\(2023\)Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee\. 2023\.Visual Instruction Tuning\. In*Advances in Neural Information Processing Systems*, Vol\. 36\.
- Loshchilov and Hutter \(2019\)Ilya Loshchilov and Frank Hutter\. 2019\.Decoupled Weight Decay Regularization\. In*International Conference on Learning Representations*\.
- Macé et al\.\(2025\)Quentin Macé, António Loison, and Manuel Faysse\. 2025\.ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval\.arXiv:2505\.17166 \[cs\.IR\][https://arxiv\.org/abs/2505\.17166](https://arxiv.org/abs/2505.17166)
- Nussbaum et al\.\(2024\)Zach Nussbaum, Brandon Duderstadt, and Andriy Mulyar\. 2024\.Nomic Embed Vision: Expanding the Latent Space\.arXiv:2406\.18587 \[cs\.CV\][https://arxiv\.org/abs/2406\.18587](https://arxiv.org/abs/2406.18587)
- Qwen Team \(2026\)Qwen Team\. 2026\.Qwen3\.5: Towards Native Multimodal Agents\.[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)
- Radford et al\.\(2021\)Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever\. 2021\.Learning Transferable Visual Models From Natural Language Supervision\. In*International Conference on Machine Learning*, Vol\. 139\. PMLR, 8748–8763\.
- Radford et al\.\(2023\)Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever\. 2023\.Robust Speech Recognition via Large\-Scale Weak Supervision\. In*International Conference on Machine Learning*, Vol\. 202\. PMLR, 28492–28518\.
- Reimers and Gurevych \(2019\)Nils Reimers and Iryna Gurevych\. 2019\.Sentence\-BERT: Sentence Embeddings using Siamese BERT\-Networks\. In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*\. Association for Computational Linguistics, 3982–3992\.
- Shi et al\.\(2016\)Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P\. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang\. 2016\.Real\-Time Single Image and Video Super\-Resolution Using an Efficient Sub\-Pixel Convolutional Neural Network\. In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*\. 1874–1883\.[https://openaccess\.thecvf\.com/content\_cvpr\_2016/html/Shi\_Real\-Time\_Single\_Image\_CVPR\_2016\_paper\.html](https://openaccess.thecvf.com/content_cvpr_2016/html/Shi_Real-Time_Single_Image_CVPR_2016_paper.html)
- Tang et al\.\(2026\)Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, and Chao Zhang\. 2026\.WAVE: Learning Unified and Versatile Audio\-Visual Embeddings with Multimodal LLM\. In*International Conference on Learning Representations*\.[https://openreview\.net/forum?id=MiV3WXDYJb](https://openreview.net/forum?id=MiV3WXDYJb)
- Tschannen et al\.\(2025\)Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai\. 2025\.SigLIP 2: Multilingual Vision\-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features\.arXiv:2502\.14786 \[cs\.CV\][https://arxiv\.org/abs/2502\.14786](https://arxiv.org/abs/2502.14786)
- Wang et al\.\(2024b\)Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei\. 2024b\.Multilingual E5 Text Embeddings: A Technical Report\.arXiv:2402\.05672 \[cs\.CL\][https://arxiv\.org/abs/2402\.05672](https://arxiv.org/abs/2402.05672)
- Wang et al\.\(2024a\)Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin\. 2024a\.Qwen2\-VL: Enhancing Vision\-Language Model’s Perception of the World at Any Resolution\.arXiv:2409\.12191 \[cs\.CV\][https://arxiv\.org/abs/2409\.12191](https://arxiv.org/abs/2409.12191)
- Xiao et al\.\(2025a\)Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, and Yu Rong\. 2025a\.Scaling Language\-Centric Omnimodal Representation Learning\.arXiv:2510\.11693 \[cs\.CL\][https://arxiv\.org/abs/2510\.11693](https://arxiv.org/abs/2510.11693)
- Xiao et al\.\(2025b\)Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, and Niklas Muennighoff\. 2025b\.MIEB: Massive Image Embedding Benchmark\.arXiv:2504\.10471 \[cs\.CV\][https://arxiv\.org/abs/2504\.10471](https://arxiv.org/abs/2504.10471)
- Yu et al\.\(2025\)Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun\. 2025\.VisRAG: Vision\-based Retrieval\-augmented Generation on Multi\-modality Documents\. In*International Conference on Learning Representations*\.[https://openreview\.net/forum?id=zG459X3Xge](https://openreview.net/forum?id=zG459X3Xge)
- Zhai et al\.\(2023\)Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer\. 2023\.Sigmoid Loss for Language Image Pre\-Training\. In*Proceedings of the IEEE/CVF International Conference on Computer Vision*\. 11975–11986\.
- Zhai et al\.\(2022\)Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer\. 2022\.LiT: Zero\-Shot Transfer With Locked\-image Text Tuning\. In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*\. 18123–18133\.
- Zhang et al\.\(2025\)Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang\. 2025\.GME: Improving Universal Multimodal Retrieval by Multimodal LLMs\.arXiv:2412\.16855 \[cs\.CL\][https://arxiv\.org/abs/2412\.16855](https://arxiv.org/abs/2412.16855)Includes gme\-Qwen2\-VL checkpoints\.Similar Articles
@JinaAI_: jina-embeddings-v5-omni is here! Our first universal embedding model for text, images, audio, and video. Available in t…
Jina AI has released jina-embeddings-v5-omni, a universal embedding model supporting text, images, audio, and video with back-compatible indexing capabilities.
@berryxia: Bro! Jina just dropped a huge one today! Jina-embeddings-v5-omni is here! It's their first unified Embedding model that truly supports text + image + audio + video! (Multimodal EMB~!) Two si...
Jina has released Jina-embeddings-v5-omni, the first unified multimodal embedding model supporting text, images, audio, and video. The model is available in Small and Nano versions, is backward compatible with existing indexes, and boasts strong performance. It is now available on Hugging Face and via the Jina API.
Multimodal Embedding & Reranker Models with Sentence Transformers
Sentence Transformers v5.4 introduces support for multimodal embedding and reranking, allowing users to encode and compare text, images, audio, and video using a unified API.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 is a unified multimodal model that achieves state-of-the-art performance by processing visual understanding and generation directly from pixel embeddings, eliminating the need for pretrained vision encoders.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.