Multimodal Embedding & Reranker Models with Sentence Transformers
Summary
Sentence Transformers v5.4 introduces support for multimodal embedding and reranking, allowing users to encode and compare text, images, audio, and video using a unified API.
View Cached Full Text
Cached at: 05/08/26, 09:09 AM
Multimodal Embedding & Reranker Models with Sentence Transformers
Source: https://huggingface.co/blog/multimodal-sentence-transformers Back to Articles
- Table of Contents
- What are Multimodal Models?
- Installation
- Multimodal Embedding Models- Loading a Model - Encoding Images - Cross-Modal Similarity - Encoding Queries and Documents
- Multimodal Reranker Models- Ranking Mixed-Modality Documents - Predicting Pair Scores
- Retrieve and Rerank
- Input Formats and Configuration- Supported Input Types - Checking Modality Support - Processor and Model kwargs
- Supported Models- Supported Multimodal Embedding Models - Supported Multimodal Reranker Models - Text-Only Reranker Models (also new in v5.4) - CLIP Models
- Additional Resources- Documentation - Training - Hugging Face Hub - Companion Blogposts
Sentence Transformersis a Python library for using and training embedding and reranker models for applications like retrieval augmented generation, semantic search, and more. With the v5.4 update, you can nowencode and compare texts, images, audio, and videosusing the same familiar API. In this blogpost, I’ll show you how to use these new multimodal capabilities for both embedding and reranking.
Multimodal embedding models map inputs from different modalities into a shared embedding space, while multimodal reranker models score the relevance of mixed-modality pairs. This opens up use cases like visual document retrieval, cross-modal search, and multimodal RAG pipelines.
If you want to train your own multimodal models, check out the companion blogpost:Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers.
https://huggingface.co/blog/multimodal-sentence-transformers#table-of-contentsTable of Contents
- What are Multimodal Models?
- Installation
- Multimodal Embedding Models- Loading a Model - Encoding Images - Cross-Modal Similarity - Encoding Queries and Documents
- Multimodal Reranker Models- Ranking Mixed-Modality Documents - Predicting Pair Scores
- Retrieve and Rerank
- Input Formats and Configuration- Supported Input Types - Checking Modality Support - Processor and Model kwargs
- Supported Models
- Additional Resources
https://huggingface.co/blog/multimodal-sentence-transformers#what-are-multimodal-modelsWhat are Multimodal Models?
Traditional embedding models convert text intofixed-size vectors. Multimodal embedding models extend this by mapping inputs from different modalities (text, images, audio, or video) into a shared embedding space. This means you can compare a text query against image documents (or vice versa) using the same similarity functions you’re already familiar with.
Similarly, traditional reranker (Cross Encoder) models compute relevance scores between pairs of texts. Multimodal rerankers can score pairs where one or both elements are images, combined text-image documents, or other modalities.
For example, you can compare a text query against image documents, find video clips matching a description, or build RAG pipelines that work across modalities.
https://huggingface.co/blog/multimodal-sentence-transformers#installationInstallation
Multimodal models require some extra dependencies. Install the extras for the modalities you need (seeInstallationfor more details):
# For image support
pip install -U "sentence-transformers[image]"
# For audio support
pip install -U "sentence-transformers[audio]"
# For video support
pip install -U "sentence-transformers[video]"
# Mix and match as needed
pip install -U "sentence-transformers[image,video,train]"
VLM-based models like Qwen3-VL-2B require a GPU with at least ~8 GB of VRAM. For the 8B variants, expect ~20 GB. If you don’t have a local GPU, consider using a cloud GPU service or Google Colab. On CPU, these models will be extremely slow; text-only or CLIP models are better suited for CPU inference.
https://huggingface.co/blog/multimodal-sentence-transformers#multimodal-embedding-modelsMultimodal Embedding Models
https://huggingface.co/blog/multimodal-sentence-transformers#loading-a-modelLoading a Model
Loading a multimodal embedding model works exactly like loading a text-only model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
Some models might require a
revisionargument for now if the integration pull requests for the model is still pending. Once they’re merged, you’ll be able to load them without specifying a revision, like above.
The model automatically detects which modalities it supports, so there’s nothing extra to configure. SeeProcessor and Model kwargsif you want to control things like image resolution or model precision.
https://huggingface.co/blog/multimodal-sentence-transformers#encoding-imagesEncoding Images
With a multimodal model loaded,model\.encode\(\)accepts images alongside text. Images can be provided as URLs, local file paths, or PIL Image objects (seeSupported Input Typesfor all accepted formats):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
# Encode images from URLs
img_embeddings = model.encode([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
print(img_embeddings.shape)
# (2, 2048)
https://huggingface.co/blog/multimodal-sentence-transformers#cross-modal-similarityCross-Modal Similarity
You can compute similarities between text embeddings and image embeddings, since the model maps both into the same space:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
# Encode images
img_embeddings = model.encode([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
# Encode text queries (one matching + one hard negative per image)
text_embeddings = model.encode([
"A green car parked in front of a yellow building",
"A red car driving on a highway",
"A bee on a pink flower",
"A wasp on a wooden table",
])
# Compute cross-modal similarities
similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)
# tensor([[0.5115, 0.1078],
# [0.1999, 0.1108],
# [0.1255, 0.6749],
# [0.1283, 0.2704]])
As expected, “A green car parked in front of a yellow building” is most similar to the car image (0.51), and “A bee on a pink flower” is most similar to the bee image (0.67). The hard negatives (“A red car driving on a highway”, “A wasp on a wooden table”) correctly receive lower scores.
You might notice that even the best matching scores (0.51, 0.67) aren’t very close to 1.0. This is due to themodality gap: embeddings from different modalities tend to cluster in separate regions of the space. Cross-modal similarities are typically lower than within-modal ones (e.g., text-to-text), but the relative ordering is preserved, so retrieval still works well.
https://huggingface.co/blog/multimodal-sentence-transformers#encoding-queries-and-documentsEncoding Queries and Documents
For retrieval tasks,encode\_query\(\)andencode\_document\(\)are the recommended methods. Many retrieval models prepend different instruction prompts depending on whether the input is a query or a document, similar to how chat models might apply different system prompts depending on the goal. Model authors can specify their prompts in the model config, andencode\_query\(\)/encode\_document\(\)automatically load and apply the correct one:
encode\_query\(\)uses the model’s"query"prompt (if available) and setstask="query".encode\_document\(\)uses the first available prompt from"document","passage", or"corpus", and setstask="document".
Under the hood, both are thin wrappers aroundencode\(\), they just handle prompt selection for you. Here’s what cross-modal retrieval looks like:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
# Encode text queries with the query prompt
query_embeddings = model.encode_query([
"Find me a photo of a vehicle parked near a building",
"Show me an image of a pollinating insect",
])
# Encode document screenshots with the document prompt
doc_embeddings = model.encode_document([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
# Compute similarities
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.3907, 0.1490],
# [0.1235, 0.4872]])
These methods accept the same input types asencode\(\)(images, URLs, multimodal dicts, etc.) and pass through the same parameters. For models without specialized query/document prompts, they behave identically toencode\(\).
https://huggingface.co/blog/multimodal-sentence-transformers#multimodal-reranker-modelsMultimodal Reranker Models
Multimodal reranker (CrossEncoder) models score the relevance between pairs of inputs, where each element can be text, an image, audio, video, or a combination. They tend to outperform embedding models in terms of quality, but are slower since they process each pair individually. The currently available pretrained multimodal rerankers focus on text and image inputs, but the architecture supports any modality that the underlying model can handle.
https://huggingface.co/blog/multimodal-sentence-transformers#ranking-mixed-modality-documentsRanking Mixed-Modality Documents
Therank\(\)method scores and ranks a list of documents against a query, supporting mixed modalities:
from sentence_transformers import CrossEncoder
model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")
query = "A green car parked in front of a yellow building"
documents = [
# Image documents (URL or local file path)
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
# Text document
"A vintage Volkswagen Beetle painted in bright green sits in a driveway.",
# Combined text + image document
{
"text": "A car in a European city",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
},
]
rankings = model.rank(query, documents)
for rank in rankings:
print(f"{rank['score']:.4f}\t(document {rank['corpus_id']})")
"""
0.9375 (document 0)
0.5000 (document 3)
-1.2500 (document 2)
-2.4375 (document 1)
"""
The reranker correctly identifies the car image (document 0) as the most relevant result, followed by the combined text+image document about a car in a European city (document 3). The bee image (document 1) scores lowest. Keep in mind that themodality gapcan influence absolute scores: text-image pair scores may occupy a different range than text-text or image-image pair scores.
You can also check which modalities a reranker supports usingmodalitiesandsupports\(\), just like with embedding models:
print(model.modalities)
# ['text', 'image', 'video', 'message']
print(model.supports("image"))
# True
# Check if the model supports a specific pair of modalities
print(model.supports(("image", "text")))
# True
https://huggingface.co/blog/multimodal-sentence-transformers#predicting-pair-scoresPredicting Pair Scores
You can also usepredict\(\)to get raw relevance scores for specific pairs of inputs:
from sentence_transformers import CrossEncoder
model = CrossEncoder("jinaai/jina-reranker-m0", trust_remote_code=True)
scores = model.predict([
("A green car", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"),
("A bee on a flower", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"),
("A green car", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"),
])
print(scores)
# [0.9389156 0.96922314 0.46063158]
https://huggingface.co/blog/multimodal-sentence-transformers#retrieve-and-rerankRetrieve and Rerank
A common pattern is to use an embedding model for fast initial retrieval, then refine the top results with a reranker:
from sentence_transformers import SentenceTransformer, CrossEncoder
# Step 1: Retrieve with an embedding model
embedder = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
query = "revenue growth chart"
query_embedding = embedder.encode_query(query)
# Pre-compute corpus embeddings (do this once, then store them)
document_screenshots = [
"path/to/doc1.png",
"path/to/doc2.png",
# ... potentially millions of document screenshots
]
corpus_embeddings = embedder.encode_document(document_screenshots, show_progress_bar=True)
# Simple cosine similarity retrieval, viable as long as embeddings fit in memory
similarities = embedder.similarity(query_embedding, corpus_embeddings)
top_k_indices = similarities.argsort(descending=True)[0][:10]
# Step 2: Rerank the top-k results with a reranker model
reranker = CrossEncoder("nvidia/llama-nemotron-rerank-vl-1b-v2", trust_remote_code=True)
top_k_documents = [document_screenshots[i] for i in top_k_indices]
rankings = reranker.rank(query, top_k_documents)
for rank in rankings:
print(f"{rank['score']:.4f}\t{top_k_documents[rank['corpus_id']]}")
Since the corpus embeddings are pre-computed, the initial retrieval is fast even over millions of documents. The reranker then provides more accurate scoring over the smaller candidate set.
https://huggingface.co/blog/multimodal-sentence-transformers#input-formats-and-configurationInput Formats and Configuration
https://huggingface.co/blog/multimodal-sentence-transformers#supported-input-typesSupported Input Types
Multimodal models accept a variety of input formats. Here’s a summary of what you can pass tomodel\.encode\(\):
ModalityAccepted FormatsText- StringsImage-PIL\.Image\.Imageobjects
- File paths (e.g."\./photo\.jpg")
- URLs (e.g."https://\.\.\./image\.jpg")
- Numpy arrays, torch tensorsAudio- File paths (e.g."\./audio\.wav")
- URLs (e.g."https://\.\.\./audio\.wav")
- Numpy/torch arrays
- Dicts with"array"and"sampling\_rate"keys
-torchcodec\.AudioDecoderinstancesVideo- File paths (e.g."\./video\.mp4")
- URLs (e.g."https://\.\.\./video\.mp4")
- Numpy/torch arrays
- Dicts with"array"and"video\_metadata"keys
-torchcodec\.VideoDecoderinstancesMultimodal- Dicts mapping modality names to values,
e.g.\{"text": "a caption", "image": "https://\.\.\./image\.jpg"\}
Valid keys:"text","image","audio","video"Message- List of message dicts with"role"and"content"keys,
e.g.\[\{"role": "user", "content": \[\.\.\.\]\}\]
https://huggingface.co/blog/multimodal-sentence-transformers#checking-modality-supportChecking Modality Support
You can check which modalities a model supports using themodalitiesproperty andsupports\(\)method:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
# List all supported modalities
print(model.modalities)
# ['text', 'image', 'video', 'message']
# Check for a specific modality
print(model.supports("image"))
# True
print(model.supports("audio"))
# False
The"message"modality indicates that the model accepts chat-style message inputs with interleaved content. In practice, you rarely need to use this directly. When you pass strings, URLs, or multimodal dicts, the model converts them to the appropriate message format internally. Sentence Transformers supports two message formats:
- Structured(most VLMs, e.g. Qwen3-VL): Content is a list of typed dicts, e.g.
\[\{"type": "text", "text": "\.\.\."\}, \{"type": "image", "image": \.\.\.\}\] - Flat(e.g. Deepseek-V3): Content is a direct value, e.g.
"some text"
The format is auto-detected from the model’s chat template.
Since all inputs get converted into the same message format internally, you can mix input types in a singleencode\(\)call:
embeddings = model.encode([
# A text input
"A green car parked in front of a yellow building",
# An image input (URL)
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
# A combined text + image input
{
"text": "A car in a European city",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
},
])
Click here if you need to pass raw message inputsIf a model doesn’t follow either format and you need full control, you can pass raw message dicts withroleandcontentkeys directly:
embeddings = model.encode([
[
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
{"type": "text", "text": "Describe this vehicle."},
],
}
],
])
This bypasses the automatic format conversion and passes the messages directly to the processor’sapply\_chat\_template\(\).
https://huggingface.co/blog/multimodal-sentence-transformers#processor-and-model-kwargsProcessor and Model kwargs
You may want to control image resolution bounds or model precision. Useprocessor\_kwargsandmodel\_kwargswhen loading the model:
model = SentenceTransformer(
"Qwen/Qwen3-VL-Embedding-2B",
model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)
processor\_kwargscontrols how inputs are preprocessed (e.g., image resolution bounds). Highermax\_pixelsmeans higher quality but more memory and compute. These are passed directly toAutoProcessor\.from\_pretrained\(\.\.\.\).model\_kwargscontrols how the underlying model is loaded (e.g., precision, attention implementation). These are passed directly to the appropriateAutoModel\.from\_pretrained\(\.\.\.\)call (e.g.,AutoModel,AutoModelForCausalLM,AutoModelForSequenceClassification, etc., depending on the configuration of the model modules).
See theSentenceTransformerAPI Reference documentation for more details on these kwargs.
In Sentence Transformers v5.4,
tokenizer\_kwargshas been renamed toprocessor\_kwargsto reflect that multimodal models use processors rather than just tokenizers. The old name is still accepted but deprecated.
https://huggingface.co/blog/multimodal-sentence-transformers#supported-modelsSupported Models
Here are the multimodal models supported in v5.4, also available in thev5.4 integrations collection:
https://huggingface.co/blog/multimodal-sentence-transformers#supported-multimodal-embedding-modelsSupported Multimodal Embedding Models
ModelParametersModalitiesRevisionQwen/Qwen3-VL-Embedding-2B2BText, Image, VideoNorevisionneededQwen/Qwen3-VL-Embedding-8B8BText, Image, VideoNorevisionneedednvidia/llama-nemotron-embed-vl-1b-v21.7BText, ImageNorevisionneedednvidia/omni-embed-nemotron-3b4.7BText, ImageNorevisionneededLCO-Embedding/LCO-Embedding-Omni-3B5BText, Image, Audio, VideoNorevisionneededLCO-Embedding/LCO-Embedding-Omni-7B9BText, Image, Audio, VideoNorevisionneededBidirLM/BidirLM-Omni-2.5B-Embedding2.5BText, Image, AudioNorevisionneededBAAI/BGE-VL-base0.1BText, ImageNorevisionneededBAAI/BGE-VL-large0.4BText, ImageNorevisionneededBAAI/BGE-VL-MLLM-S18BText, ImageNorevisionneededBAAI/BGE-VL-MLLM-S28BText, ImageNorevisionneededBAAI/BGE-VL-v1.5-zs8BText, ImageNorevisionneededBAAI/BGE-VL-v1.5-mmeb8BText, ImageNorevisionneededBAAI/BGE-VL-Screenshot4BText, ImageNorevisionneededroyokong/e5-v8BText, ImageNorevisionneededeagerworks/eager-embed-v14BText, Imagerevision="refs/pr/2"nomic-ai/nomic-embed-multimodal-3b5BText, Imagerevision="refs/pr/4"nomic-ai/nomic-embed-multimodal-7b9BText, Imagerevision="refs/pr/3"Haon-Chen/e5-omni-3B5BText, Image, Audio, Videorevision="refs/pr/2"Haon-Chen/e5-omni-7B9BText, Image, Audio, Videorevision="refs/pr/1"
https://huggingface.co/blog/multimodal-sentence-transformers#supported-multimodal-reranker-modelsSupported Multimodal Reranker Models
https://huggingface.co/blog/multimodal-sentence-transformers#text-only-reranker-models-also-new-in-v54Text-Only Reranker Models (also new in v5.4)
Click here for a text-only reranker usage example``` from sentence_transformers import CrossEncoder
model = CrossEncoder(“mixedbread-ai/mxbai-rerank-base-v2”)
query = “How do I bake sourdough bread?” documents = [ “Sourdough bread requires a starter made from flour and water, fermented over several days.”, “The history of bread dates back to ancient Egypt around 8000 BCE.”, “To bake sourdough, mix your starter with flour, water, and salt, then let it rise overnight.”, “Rye bread is a popular alternative to wheat-based breads in Northern Europe.”, ]
pairs = [(query, doc) for doc in documents] scores = model.predict(pairs) print(scores)
[ 7.3077507 -2.6217823 8.724761 -2.2488995]
rankings = model.rank(query, documents) for rank in rankings: print(f“{rank[‘score’]:.4f}\t{documents[rank[‘corpus_id’]]}“)
8.7248 To bake sourdough, mix your starter with flour, water, and salt, then let it rise overnight.
7.3078 Sourdough bread requires a starter made from flour and water, fermented over several days.
-2.2489 Rye bread is a popular alternative to wheat-based breads in Northern Europe.
-2.6218 The history of bread dates back to ancient Egypt around 8000 BCE.
### [https://huggingface.co/blog/multimodal-sentence-transformers#clip-models](https://huggingface.co/blog/multimodal-sentence-transformers#clip-models)CLIP Models
The older CLIP models continue to be supported:
These simple CLIP models still work well on lower\-resource hardware\.
Click here for a CLIP usage example```
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/clip-ViT-L-14")
images = [
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
"https://huggingface.co/datasets/huggingface/cats-image/resolve/main/cats_image.jpeg"
]
texts = ["A green car", "A bee on a flower", "Some cats on a couch", "One cat sitting in the window"]
image_embeddings = model.encode(images)
text_embeddings = model.encode(texts)
print(image_embeddings.shape, text_embeddings.shape)
# (3, 768) (4, 768)
similarities = model.similarity(image_embeddings, text_embeddings)
print(similarities)
# tensor([[0.2208, 0.1042, 0.0617, 0.0907], First image (car) is most similar to "A green car"
# [0.1205, 0.2303, 0.0632, 0.0917], Second image (bee) is most similar to "A bee on a flower"
# [0.1107, 0.0196, 0.2425, 0.1162]]) Third image (multiple cats) is most similar to "Some cats on a couch"
https://huggingface.co/blog/multimodal-sentence-transformers#additional-resourcesAdditional Resources
https://huggingface.co/blog/multimodal-sentence-transformers#documentationDocumentation
- Sentence Transformer > Usage
- Sentence Transformer > Pretrained Models
- Cross Encoder > Usage
- Cross Encoder > Pretrained Models
- Installation
https://huggingface.co/blog/multimodal-sentence-transformers#trainingTraining
To learn how to finetune these multimodal models on your own data, see the companion blogpost:Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers.
- Sentence Transformer > Training Overview
- Sentence Transformer > Training Examples
- Cross Encoder > Training Overview
- Cross Encoder > Training Examples
- Sparse Encoder > Training Overview
- Sparse Encoder > Training Examples
https://huggingface.co/blog/multimodal-sentence-transformers#hugging-face-hubHugging Face Hub
- Sentence Transformers models on the Hub
- Sentence Transformers datasets on the Hub
- v5.4 Integrations Collection
https://huggingface.co/blog/multimodal-sentence-transformers#companion-blogpostsCompanion Blogposts
The training companion to this post and adjacent Sentence Transformers guides:
- Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers: the direct training companion to this post, with a Visual Document Retrieval walkthrough.
- Training and Finetuning Embedding Models with Sentence Transformers: the general training guide for text-only bi-encoder embedding models.
- Training and Finetuning Reranker Models with Sentence Transformers: Cross Encoder (reranker) training, applicable to text-only and multimodal rerankers.
- Training and Finetuning Sparse Embedding Models with Sentence Transformers: SPLADE training for sparse retrieval.
- 🪆 Introduction to Matryoshka Embedding Models: variable-size embeddings; also applied to multimodal models in the training companion post.
- Train 400x faster Static Embedding Models with Sentence Transformers: CPU-friendly text embedding models.
- Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval: post-training compression of embedding vectors.
Similar Articles
Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
This article provides a technical guide on training and fine-tuning multimodal embedding and reranker models using the Sentence Transformers library, demonstrating performance improvements on Visual Document Retrieval tasks with Qwen3-VL.
jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition
This paper introduces jina-embeddings-v5-omni, a suite of multimodal embedding models that extend text embeddings to image, audio, and video using frozen-tower composition. The method trains only 0.35% of the total weights, maintaining text geometry while achieving competitive state-of-the-art performance with significantly lower computational cost.
Your Embedding Model is SMARTer Than You Think
SMART is a framework that unlocks latent multi-vector capabilities in single-vector models for multimodal retrieval, improving state-of-the-art performance with reduced computational costs via contrastive training and late-interaction inference.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.
getting past the text only bottleneck with multimodal??
The article discusses how multimodal AI models like GPT-4o and Claude 3.5 Sonnet are overcoming text-only bottlenecks by enabling visual debugging, audio-to-data conversion, and enhanced RAG systems.