KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking
Summary
KaLM-Reranker-V1 is a fast reranker that decouples query and passage computation using an encoder-decoder architecture with Matryoshka embedding pooling and cross-attention, achieving state-of-the-art reranking performance on BEIR and competitive results on multilingual benchmarks.
View Cached Full Text
Cached at: 06/23/26, 05:41 AM
Paper page - KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking
Source: https://huggingface.co/papers/2606.22807
Abstract
KaLM-Reranker-V1 is a fast reranker that decouples query and passage computation using encoder-decoder architecture with Matryoshka embedding pooling and cross-attention for efficient relevance modeling.
As retrieval systems scale, high-quality reranking becomes increasingly important. However, most existingrerankers, whether encoder-based or decoder-based, jointly encode the query and passage, tightly coupling their computation and limiting deployment efficiency as well as flexibility. We present KaLM-Reranker-V1, a fast but notlate-interaction(FBNL)rerankerthat decouples query and passage computation while retaining expressive relevance modeling. Built on anencoder-decoder architecture, KaLM-Reranker-V1 uses the encoder to pre-encode passages withMatryoshka embedding pooling, while the decoder models the system instruction, user instruction, and query intent;cross-attentionthen captures relevance between the query context and passage representations. This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling throughcross-attention. We instantiate KaLM-Reranker-V1 in three sizes, Nano, Small, and Large, with 0.27B, 1B, and 4B activated parameters, respectively. Extensive experiments onBEIR,MIRACL, andLMEBdemonstrate that KaLM-Reranker-V1 achieves strong reranking performance with superior efficiency. OnBEIR, KaLM-Reranker-V1 achieves state-of-the-art performance, on par with strong industrial models such as the Qwen3-Rerankerseries; onMIRACL, despite not being extensively trained on multilingual data, KaLM-Reranker-V1 still shows excellent reranking performance. Moreover, onLMEB, reranking models demonstrate a clear advantage, with even the 0.27B Nano model remaining competitive with 7-12B embedding models.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.22807
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper3
#### KaLM-Embedding/KaLM-Reranker-V1-Nano Text Ranking• 0.8B• Updatedabout 3 hours ago • 1
#### KaLM-Embedding/KaLM-Reranker-V1-Small Text Ranking• 2B• Updatedabout 3 hours ago • 2 • 1
#### KaLM-Embedding/KaLM-Reranker-V1-Large Text Ranking• 8B• Updatedabout 3 hours ago • 1
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.22807 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.22807 in a Space README.md to link it from this page.
Collections including this paper2
Similar Articles
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker is a reasoning-aware reranking model family (0.6B/4B) designed for agent memory retrieval, addressing limitations in semantic similarity by incorporating LLM knowledge distillation for better temporal and causal reasoning.
@lu__jasper: Some early results from playing around with search on a subsampled version of OBLIQ-bench. Mixedbread's reranker is a b…
Early results from testing search on a subsampled OBLIQ-bench show that Mixedbread's reranker achieves strong MRR, sometimes outperforming GPT 5.5 on certain metrics with faster speed, though the benchmark remains challenging.
@liquidai: Introducing LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: two multilingual retrieval models built for ultra-fast and a…
Liquid AI introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, two multilingual retrieval models optimized for fast and accurate search across 11 languages, with latency as low as 1.5ms.
CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
CompressKV proposes a semantic-retrieval-guided KV-cache compression method for GQA-based LLMs, identifying Semantic Retrieval Heads to retain critical tokens. It achieves over 97% full-cache performance using only 3% of the KV cache on LongBench tasks.
River-LLM: Large Language Model Seamless Exit Based on KV Share
River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.