Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality
Summary
IBM releases Granite Embedding Multilingual R2, a family of open-source multilingual embedding models under Apache 2.0, featuring a compact 97M model that achieves best-in-class sub-100M retrieval quality and a 311M model with Matryoshka embeddings, both supporting 32K context and 200+ languages.
View Cached Full Text
Cached at: 05/15/26, 12:17 AM
Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality
Source: https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2 Back to Articles
- Enterprise-Ready by Design - A Strong Sub-100M Multilingual Model - What Changed from R1 - Training the Full-Size 311M Model - Building the compact 97M Multilingual model - Benchmark Results- Multilingual Retrieval - Speed and Throughput - Matryoshka Embeddings (311M)- Cross-lingual Retrieval - Deployment Options - For Framework Integrators - Which Model Should You Use? - Try The Models **TL;DR:**Two new Apache 2.0 multilingual embedding models built on ModernBERT — a 97M-parameter compact model that beats every open sub-100M multilingual embedder on MTEB Multilingual Retrieval (60.3), and a 311M full-size model that scores 65.2 on MTEB Multilingual Retrieval (#2 among open models under 500M parameters) with Matryoshka support. Both cover 200+ languages, are tuned on 52 languages, handle 32K-token context (64x R1), and add code retrieval across 9 programming languages.
In this post:Enterprise-Ready by Design·A Strong Sub-100M Multilingual Model·What Changed from R1·Training the Full-Size 311M Model·Building the compact 97M Multilingual Model·Benchmark Results·Matryoshka Embeddings·Deployment Options·For Framework Integrators·Which Model Should You Use?·Try The Models
Multilingual embedding models face a persistent tension: broad language coverage usually comes at the cost of model size, and small models usually sacrifice languages. If you work across languages — retrieval-augmented generation over multilingual corpora, cross-lingual search, code retrieval in international teams — you’ve likely had to choose between a model that’s fast enough and one that’s good enough.
The Granite Embedding Multilingual R2 release narrows that gap considerably. We’re releasing two new multilingual embedding models:
- granite-embedding-311m-multilingual-r2— A 311M-parameter full-size model with 768-dimensional embeddings, Matryoshka dimension support, and top-tier multilingual retrieval quality.
- granite-embedding-97m-multilingual-r2— A 97M-parameter compact model with 384-dimensional embeddings that delivers strong retrieval quality for its size.
Both models support200+ languageswith enhanced retrieval quality for52 languages and programming code, handle context lengths up to32,768 tokens(a 64x increase over their R1 predecessors), and are released under theApache 2.0license. They work out of the box withsentence\-transformersandtransformers, require no task-specific instructions, and are compatible as drop-in replacements inLangChain,LlamaIndex,Haystack, andMilvuswith a one-line model name change. For frameworks currently using an English-only default, that one line gives every user in your community support for 200+ languages — no API changes, no new dependencies, no code changes required on their end. Both models ship with ONNX and OpenVINO weights for CPU-optimized inference.
52 enhanced-support languages(click to expand)The underlying encoder was pretrained on text from 200+ languages, producing general-purpose embeddings for any of them. The following 52 languages receive explicit retrieval-pair and cross-lingual training for higher-quality retrieval:
Albanian (sq), Arabic (ar), Azerbaijani (az), Bengali (bn), Bulgarian (bg), Catalan (ca), Chinese (zh), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Khmer (km), Korean (ko), Latvian (lv), Lithuanian (lt), Malay (ms), Marathi (mr), Norwegian (no), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovenian (sl), Spanish (es), Swahili (sw), Swedish (sv), Tagalog (tl), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi).
Additionally, the models are trained onprogramming code(Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, C++) and support cross-lingual code retrieval.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#enterprise-ready-by-designEnterprise-Ready by Design
Both embedding models are trained on a mixture of IBM‑curated datasets, publicly available data, and internally generated or synthetic data. Public web‑derived data used in training is selected and filtered using IBM‑developed quality, deduplication, and governance processes intended to reduce risk in downstream commercial use. We intentionally avoid the use of the MS‑MARCO training dataset and datasets with explicit non‑commercial licensing restrictions. The models are pretrained usingGneissWeb, an IBM‑curated dataset derived from publicly available web content and processed using IBM’s data preparation and governance tooling—along with additional IBM‑curated and other publicly available sources. Datasets undergo IBM governance review to assess licensing considerations, ownership signals, and personal data risks. These processes are designed to contribute to responsible use and enterprise deployment.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#a-strong-sub-100m-multilingual-modelA Strong Sub-100M Multilingual Model
The standout of this release isgranite-embedding-97m-multilingual-r2. At 97 million parameters, it scores60.3 on Multilingual MTEB Retrievalacross 18 languages — the highest retrieval score we’ve found for any open multilingual embedding model under 100M parameters. The next-best model in that size class, multilingual-e5-small, scores 50.9 on the same benchmark — a**+9.4 point gap**on a mature benchmark.
At roughly one-third the size of the 311M full-size model, it retains the majority of its retrieval quality across multilingual, code, and long-document benchmarks — a**+12.2 point gain on MTEB Multilingual Retrievalover its direct predecessor, driven by a new architecture, better training data, and a novel pruning methodology (more on that below). The full-sizegranite-embedding-311m-multilingual-r2scores65.2on the same benchmark, a+13.0 point gain**over its R1 predecessor.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#what-changed-from-r1What Changed from R1
The Granite Embedding Multilingual R1 models were built on XLM-RoBERTa encoders with a 512-token context window. The R2 generation is a ground-up rebuild:
ModernBERTis a recent encoder architecture that revisits the original BERT design with techniques from the last five years of transformer research. The shift brings several practical benefits: alternating attention lengths reduce computation on long sequences (improves throughput on long sequences significantly), rotary position embeddings allow the 32K context window without the positional interpolation hacks that plague older architectures, and Flash Attention 2.0 support speeds up encoding on modern GPUs.
The new multilingual tokenizers are worth highlighting. Rather than reusing XLM-RoBERTa’s 250K-token vocabulary, we adopted existing tokenizers with strong multilingual and code coverage. The 311M model uses the Gemma 3 tokenizer (262K tokens); the 97M model starts from the GPT-OSS tokenizer and prunes it down to a compact 180K-token vocabulary that preserves broad multilingual coverage while reducing the embedding table’s parameter footprint. Tokenizer efficiency matters more than people realize — a 32K-token window sounds impressive until your tokenizer burns half of it encoding a single paragraph of Thai.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#training-the-full-size-311m-modelTraining the Full-Size 311M Model
The 311M model is a 22-layer ModernBERT encoder with a 262K-token multilingual vocabulary, trained through a multi-stage pipeline:
- Knowledge distillation: The model learns from multiple teacher models simultaneously. The teachers are Granite 3.3 Instruct and Mistral v0.2 Instruct decoder models, further finetuned for text embeddings, which transfer retrieval-specific knowledge into the 311M encoder architecture.
- Contrastive fine-tuning: Standard contrastive training on multilingual retrieval pairs — queries matched with relevant and hard-negative passages across 52 languages and code — sharpens the model’s ability to distinguish relevant from irrelevant results.
- Model merging: After training, we merge checkpoints from different training stages and configurations. This combines the strengths of models optimized for different objectives (e.g., multilingual breadth vs. English depth) into a single set of weights without additional training compute.
- Matryoshka Representation Learning: The model is trained with Matryoshka objectives so that its 768-dimensional embeddings can be truncated to 512, 384, 256, or 128 dimensions with minimal quality loss (seeMatryoshka Embeddingsbelow).
The result is a model that scores 65.2 on MTEB Multilingual Retrieval and 56.3 on the overall average — a +14.5 point average gain over its R1 predecessor.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#building-the-compact-97m-multilingual-modelBuilding the compact 97M Multilingual model
The 97M model is trained through a combination ofvocabulary selectionandknowledge distillation:
- Vocabulary selection: The 262K-token vocabulary is reduced to a purpose-trained 180K-token vocabulary that preserves broad multilingual coverage while cutting the embedding table size substantially.
- Knowledge distillation: The pruned model is then finetuned using knowledge distillation from multiple teacher models (including aGranite 4.1 8Band Mistral Instruct decoder-based teacher) and contrastive training to improve retrieval quality.
This approach transfers retrieval-specific knowledge from multiple strong teachers, while reducing the model parameters without sacrificing language coverage. The result is a highly efficient compact model — scoring 60.3 on MTEB Multilingual Retrieval vs. 65.2 for the full-size model, while being approximately 3x smaller.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#benchmark-resultsBenchmark Results
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#multilingual-retrievalMultilingual Retrieval
Performance across the main benchmark suite sorted by model size. Scores are averages across tasks within each benchmark (higher is better):
ModelParamsActive ParamsEmbed DimMTEB Multilingual Retrieval (18)Code (12)English Retrieval (10)LongEmbed (6)RaR-b (17)F2LLM-v2-80M80M32M32050.168.047.531.717.9multilingual-e5-small118M22M38450.953.546.538.820.3granite-embedding-107m-multilingual (R1)107M11M38448.140.747.934.317.1paraphrase-multilingual-MiniLM-L12-v2118M22M38436.623.535.920.910.9jina-embeddings-v5-text-nano212M113M76863.371.258.863.625.2harrier-oss-v1-270m268M100M64066.462.452.164.932.9multilingual-e5-base278M86M76852.752.649.040.523.4granite-embedding-278m-multilingual (R1)278M86M76852.248.551.537.718.9embeddinggemma-300m308M106M76862.568.754.655.426.1gte-multilingual-base305M113M76857.257.550.862.119.0snowflake-arctic-embed-m-v2.0305M113M76854.855.258.455.423.3multilingual-e5-large560M304M102453.755.851.540.425.4text-embedding-3-small (OpenAI, API only)——153650.7—53.853.623.2**granite-embedding-97m-multilingual-r297M28M38460.360.450.165.624.9granite-embedding-311m-multilingual-r2311M110M768****65.2 (#2)****63.8 (#3)****52.6 (#5)****71.7 (#1)**28.0 (#2) A few things stand out:
- The 97M R2 model beats multilingual-e5-base and gte-multilingual-base(~300M parameter models) on average and on most individual benchmarks, despite being roughly 3x smaller.
paraphrase\-multilingual\-MiniLM\-L12\-v2— a widely-used framework default — scores 36.6, a full**+23.7 points**behind the 97M R2 model, which is also slightly smaller (97M vs 110M parameters) with the same 384-dimensional output.- LongEmbed is the biggest R1-to-R2 gain: +31.3 points for the 97M model, +34.0 for the 311M. This is the direct payoff of the 32K context window — R1’s 512-token limit meant your legal contract was being judged by its first page. Many practical multilingual workloads involve long documents (legal contracts, technical manuals, research papers, multi-page reports) that R1 simply could not see in full.
- Code retrieval improves dramatically: +19.7 (97M) and +15.3 (311M) over R1, reflecting the new code training set, larger context window, and better training methodology.
- In the broader competitive field, harrier-oss-v1-270m leads on MTEB Multilingual Retrieval (66.4) and RaR-b (32.9), while jina-embeddings-v5-text-nano leads on Code (71.2) and English Retrieval (58.8). The 311M Granite model is competitive on average (56.3) and leads on LongEmbed (71.7), while offering substantially higher encoding throughput than jina-embeddings-v5-text-nano (see speed table below).
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#speed-and-throughputSpeed and Throughput
Encoding speed matters for production workloads, especially when you’re indexing millions of documents or need low-latency query encoding. We measured latency and throughput on a single NVIDIA H100 GPU using 512-token chunks:
The 97M model encodes over 2,500 documents per second — comparable throughput to multilingual-e5-small — while delivering substantially higher retrieval quality. The 311M model, at ~1,800 docs/sec, performs better than jina-embeddings-v5-text-nano on retrieval quality (65.2 vs. 63.3) at over 5.5x the encoding speed (note: speed numbers are computed with the latest transformer code, which had a speed regression vs the last 4.57 version - for both the Jina and granite models - see our technical report for details). harrier-oss-v1-270m offers the best combination of speed and retrieval score among the competitors listed here.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#matryoshka-embeddings-311mMatryoshka Embeddings (311M)
The 311M model supportsMatryoshka Representation Learning, which lets you truncate embeddings from the full 768 dimensions down to 512, 384, 256, or 128 with graceful quality degradation. This is useful when storage, memory, or similarity-computation cost is a concern — a 256-dimensional embedding takes one-third the storage of a 768-dimensional one, and cosine similarity is proportionally cheaper to compute.
Here’s how retrieval quality holds up across embedding dimensions:
The quality loss from dimension reduction is remarkably small. Cutting from 768 to 256 dimensions — a3x reductionin storage and similarity-computation cost — drops MTEB Multilingual Retrieval by just 0.5 points (65.2 → 64.7) and Code Retrieval by 0.5 points (63.9 → 63.4). Even at 128 dimensions (a6x reduction), the model still scores 63.7 on MTEB Multilingual Retrieval and 62.3 on Code — retaining over**97%**of its full-dimension performance. In practice, this means you can substantially reduce your index size and search latency with minimal impact on result quality. (Note,results in the above picture were evaluated with a context length of 1024 for English and Multilingual Retrieval, and 8192 for Code).
For comparison, the 311M model truncated to 384 dimensions (the same dimensionality as the 97M model’s native output) still outperforms the 97M model across all three benchmarks. If you need 384-dimensional embeddings and can afford the 311M model’s encoding cost, Matryoshka truncation is the stronger option.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ibm-granite/granite-embedding-311m-multilingual-r2")
# Full 768-dimensional embeddings
full = model.encode(["example text"])
print(full.shape) # (1, 768)
# Truncated to 384 dimensions
small = model.encode(["example text"], truncate_dim=384)
print(small.shape) # (1, 384)
The 97M model does not support Matryoshka — 384 dimensions is already compact.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#cross-lingual-retrievalCross-lingual Retrieval
Average performance on cross-lingual tasks within MTEB Retrieval.Belebelemeasures cross-lingual passage matching across 122 languages; MLQA measures extractive cross-lingual question answering retrieval across 7 languages.
ModelBelebele RetrievalMLQA Retrievalgranite-embedding-107m-multilingual (R1)55.160.5granite-embedding-278m-multilingual (R1)62.263.0granite-embedding-97m-multilingual-r252.960.5granite-embedding-311m-multilingual-r266.567.1 The 311M R2 model gains +4.3 on Belebele and +4.1 on MLQA over its R1 predecessor, showing improved cross-lingual transfer at the larger scale across both benchmarks.
The 97M R2 model scores lower on Belebele (52.9 vs 55.1, −2.2) while matching its R1 predecessor on MLQA (60.5). The Belebele gap is a tradeoff inherent in the pruning and vocabulary reduction process — the R2 model’s training prioritized the broader 18-language MTEB Multilingual Retrieval set (where it gains +12.2 over R1) and long-document retrieval (+31.3), while the smaller vocabulary (180K vs. 250K tokens) and reduced layer count (12 vs. 22) affect narrow cross-lingual transfer tasks. If cross-lingual transfer across many language pairs is your primary use case, the full-size 311M model is the better choice.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#deployment-optionsDeployment Options
Both models ship with multiple deployment paths for production use. Install the core library with:
pip install sentence-transformers
Sentence Transformers(recommended for most users):
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("ibm-granite/granite-embedding-97m-multilingual-r2")
queries = [
"What is the tallest mountain in Japan?", # English
"Wer hat das Lied Achy Breaky Heart geschrieben?", # German
"ドイツの首都はどこですか?", # Japanese
]
passages = [
"富士山は、静岡県と山梨県にまたがる活火山で、標高3776.12 mで日本最高峰の独立峰である。", # Japanese
"Achy Breaky Heart is a country song written by Don Von Tress.", # English
"Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland.", # German
]
q_emb = model.encode(queries)
p_emb = model.encode(passages)
print(util.cos_sim(q_emb, p_emb))
# Each query scores highest against its matching passage — across languages
LangChain(pip install langchain\-huggingface):
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="ibm-granite/granite-embedding-97m-multilingual-r2"
)
docs = embeddings.embed_documents([
"富士山は日本最高峰の独立峰です。",
"Mount Fuji is Japan's highest peak.",
])
query = embeddings.embed_query("What is Japan's tallest mountain?")
# Drop-in replacement anywhere LangChain accepts an Embeddings object
LlamaIndex(pip install llama\-index\-embeddings\-huggingface):
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
embed_model = HuggingFaceEmbedding(
model_name="ibm-granite/granite-embedding-97m-multilingual-r2"
)
Settings.embed_model = embed_model # applies globally to any index or pipeline
Haystack(pip install sentence\-transformers haystack\-ai)```
from haystack.components.embedders import (
SentenceTransformersDocumentEmbedder,
SentenceTransformersTextEmbedder,
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
doc_embedder = SentenceTransformersDocumentEmbedder( model=“ibm-granite/granite-embedding-97m-multilingual-r2” ) query_embedder = SentenceTransformersTextEmbedder( model=“ibm-granite/granite-embedding-97m-multilingual-r2” ) doc_embedder.warm_up() query_embedder.warm_up()
Embed and index documents
document_store = InMemoryDocumentStore() result_docs = doc_embedder.run(documents=[ Document(content=“富士山は日本最高峰の独立峰です。”), Document(content=“Mount Fuji is Japan’s highest peak.”), Document(content=“Achy Breaky Heart is a country song written by Don Von Tress.”), Document(content=“Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland.”), ]) document_store.write_documents(result_docs[“documents”])
Embed query and retrieve
result_query = query_embedder.run(text=“What is Japan’s tallest mountain?”) retriever = InMemoryEmbeddingRetriever(document_store=document_store) results = retriever.run(query_embedding=result_query[“embedding”], top_k=2) for doc in results[“documents”]: print(f“{doc.score:.3f} {doc.content}“)
0.961 Mount Fuji is Japan’s highest peak.
0.913 富士山は日本最高峰の独立峰です。
**Milvus**\(`pip install pymilvus sentence\-transformers`\)```
from pymilvus import MilvusClient
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ibm-granite/granite-embedding-97m-multilingual-r2")
# Use "./milvus.db" for local persistence or a server URI for production
client = MilvusClient(":memory:")
client.create_collection(collection_name="multilingual_docs", dimension=384)
docs = [
"富士山は日本最高峰の独立峰です。",
"Mount Fuji is Japan's highest peak.",
"Achy Breaky Heart is a country song written by Don Von Tress.",
"Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland.",
]
embeddings = model.encode(docs).tolist()
client.insert(
collection_name="multilingual_docs",
data=[{"id": i, "vector": emb, "text": doc} for i, (emb, doc) in enumerate(zip(embeddings, docs))],
)
query_emb = model.encode(["What is Japan's tallest mountain?"]).tolist()
results = client.search(
collection_name="multilingual_docs",
data=query_emb,
limit=2,
output_fields=["text"],
)
for hit in results[0]:
print(f"{hit['distance']:.3f} {hit['entity']['text']}")
# 0.961 Mount Fuji is Japan's highest peak.
# 0.913 富士山は日本最高峰の独立峰です。
Both models also ship with pre-convertedONNXandOpenVINOweights for optimized CPU/accelerator inference, work as embedding endpoints via**vLLM(vllm serve \.\.\. \-\-task embed), and can be converted to GGUF forOllama**usingllama.cpp. See the model cards for full deployment examples.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#for-framework-integratorsFor Framework Integrators
If you maintain an embedding framework, vector store, or RAG pipeline library and are evaluating these models as a default, here’s what you need to know:
- License: Apache 2.0, trained without MS-MARCO
- Drop-in behavior: No task-specific instruction prefix required — behaves like
all\-MiniLM\-L6\-v2at the API level. Existing code that calls\.encode\(\)works unchanged. - Dimensionality: 384-dimensional output (97M) and 768-dimensional output (311M), matching the most common existing defaults. No index migration required.
- Model size: The 97M model’s weights are 195 MB (safetensors) — less than half the size of
paraphrase\-multilingual\-MiniLM\-L12\-v2(471 MB), the most common multilingual default. The quantized ONNX weights are just 98 MB, comparable toall\-MiniLM\-L6\-v2(91 MB) while covering 200+ languages. - CPU-friendly: Ships with ONNX and OpenVINO weights for optimized CPU inference. No GPU dependency for a getting-started tutorial.
- Multilingual by default: If your current default is English-only, this is a one-line swap that gives every user in your community support for 200+ languages — without touching their code.
- Stable identifier:
ibm\-granite/granite\-embedding\-97m\-multilingual\-r2on Hugging Face, maintained by IBM under the Granite model family.
To discuss adopting these models as a default in your project, open an issue atibm-granite/granite-embedding-models.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#which-model-should-you-useWhich Model Should You Use?
These two multilingual models are part of the broaderGranite Embedding R2family, which also includes two high-performing English-focused models:granite-embedding-english-r2(149M parameters) andgranite-embedding-small-english-r2(47M parameters). If your data is predominantly English, the English models offer higher retrieval quality on English benchmarks at a smaller footprint, since they don’t need to allocate capacity across 200+ languages.
https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2#try-the-modelsTry The Models
Both models are available now on Hugging Face under theIBM Granite Embedding collection:
You will also be able to try the small models interactively (on CPU) shortly via a Granite Embedding demo (coming soon) on Hugging Face Spaces, or run the full examples notebook in Google Colab:
You can access our detailed technical report covering the full training methodology, per-language evaluations, and pruning ablations hereGranite Multilingual Embedding R2 report. For questions, feedback, or issues, visitibm-granite/granite-embedding-modelson GitHub.
**Framework maintainers:**If you’d like to adopt these models as a default in your project, open an issue atibm-granite/granite-embedding-models— we’re happy to help with integration, testing, and any questions about licensing or deployment.
Give them a try, and if the embeddings spark joy, smash that ❤️ button on Hugging Face. Our models have feelings too, and every +1 keeps them warm at night.
Similar Articles
Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems
This paper benchmarks Google Embeddings 2 against five open-source models for multilingual dense retrieval and RAG, finding GE2 top in accuracy but slower, with mE5-L as a competitive low-latency alternative.
ibm-granite/granite-4.1-8b · Hugging Face
IBM releases Granite-4.1-8B, an Apache 2.0 licensed 8B parameter long-context instruct model with enhanced tool-calling and multilingual support.
New and improved embedding model
OpenAI released text-embedding-ada-002, a unified embedding model that consolidates five previous models into one with superior performance, 4x longer context (8192 tokens), smaller dimensionality (1536), and 99.8% lower pricing than previous Davinci embeddings.
zsxkib/embedding-gemma-300m
zsxkib/embedding-gemma-300m is a Replicate deployment of Google's EmbeddingGemma-300M model for generating 768-dimensional text embeddings, supporting flexible output dimensions via Matryoshka representation learning.
New embedding models and API updates
OpenAI released two new embedding models: text-embedding-3-small (5x cheaper than ada-002 with 40%+ MIRACL improvement) and text-embedding-3-large (best performance with up to 3072 dimensions). Both models show significant performance gains on standard benchmarks while reducing costs.



