@topk_io: https://x.com/topk_io/status/2065172828161200563
Summary
TopK introduces semantic_index, a single schema annotation that abstracts multi-vector retrieval complexity for production systems, achieving state-of-the-art performance with sub-second latency and high throughput.
View Cached Full Text
Cached at: 06/12/26, 08:59 AM
semantic_index: multi-vector retrieval, out of the box
State-of-the-art retrieval quality usually comes with a catch. You start with an embedding pipeline, a vector database, add hybrid search with RRF, and maybe a reranker. But production search systems don’t live isolated inside notebooks. They are a distributed system with reliability and latency budgets, constant writes, strict freshness requirements, high-QPS reads, filters, permissions, and much more.
That’s why we built **semantic_index - **a single schema annotation that abstracts this complexity and enables state-of-the-art retrieval ready for production. Batteries included.
pythonfrom topk_sdk.schema import text, semantic_index
client.collections().create( “docs”, schema = { “text”: text().index(semantic_index()) } )
That’s the entire setup. No embedding pipeline, no separate vector store, no reranking service. Under the hood, semantic_index is powered by Iso-ModernColBERT (our multi-vector embedding model) and sparse multi-vector encoding (SMVE), which makes late interaction scale to billions of documents with filtering and online index updates.
Why multi-vector?
Single-vector (dense) embeddings compress an entire document into one point in high-dimensional space. That works until your queries get specific, for example, a clause in a contract, a row in a financial table, or a step in a procedure. Late interaction models keep one embedding per token and match at token granularity, which is why they consistently outperform dense models on out-of-domain and long context retrieval.
The problem was never quality; it was cost and operational complexity. Multi-vector indexes are an order of magnitude larger, and exact MaxSim scoring is computationally expensive. TopK solves this by identifying a small set of candidates using fast sparse approximations and then refining them to the final **top-k **results using quantized MaxSim reranking.
Performance
-
**Ingest 1.5B+ tokens/hour - **embedded and indexed, with sub-second index lag. Your documents are searchable as you write them.
-
295 QPS across BEIR with ~75ms p99 latency - high-quality search without performance tradeoffs.
-
**52.88% nDCG@10 on BEIR - **end-to-end, on the live system, not an offline eval of the model alone.
-
**~30% higher recall and nDCG@10 on ViDoRe v3 - **state-of-the-art performance beating 80x bigger Qwen3-VL-Embedding-8B model.
-
**80.48% accuracy on BrowseComp-Plus **- top-5 performing research agent without a complicated search pipeline.
BEIR: Baseline
Most retrieval benchmarks measure the model in isolation. We measured the whole system end-to-end - including document ingestion, embedding inference, indexing, and concurrent queries on our production clusters. Across all 15 BEIR datasets, semantic_index averages 52.88% nDCG@10, only ~1% lower than the base model with exact MaxSim.
On the performance side, every dataset cleared 175 QPS, smaller corpora pushed past 390 QPS, and p99 latency stayed between 50ms and 125ms with sub-second index lag throughout. Your documents are searchable as you write them, which is becoming increasingly more important for agentic use cases.
ViDoRe V3: Enterprise Retrieval
BEIR is text-only. Enterprise retrieval is complex PDFs, tables, slides, and multilingual documents, which is exactly where token-level matching pulls away from dense embeddings. We compared the semantic_index against Qwen3-VL-Embedding-8B, a state-of-the-art dense embedding model 80x larger than ours.
**+34% recall, +30% nDCG@10 **improvement on average. On industrial documentation, recall jumps from 42.05% to 75.97% (+81%). Finance (EN) goes from 58.90% to 80.91%. Pharma from 60.16% to 83.52%. There isn’t a single domain where the 8B dense model wins.
BrowseComp-Plus: Agentic Search
Retrieval is increasingly consumed by agents, not humans. We plugged semantic_index into an agentic research loop as the retriever for BrowseComp-Plus. gpt-5 agent with the default harness achieved 80.48% accuracy and **77.82% recall, **which currently places it top-5 in the overall leaderboard (as of 11/06/2026).
The agent averaged ~14 search calls per task with 88.54% citation precision, which indicates that the retriever is surfacing the right documents early enough for the agent to ground its answers without wasting tokens.
Try it today
Everything above - multi-vector embedding inference, retrieval with quantized MaxSim reranking, online index updates, and filtering - is available today behind simple, high-level abstractions.
pythonfrom topk_sdk.schema import text, semantic_index from topk_sdk.query import select, field, fn
Create collection with semantic_index
client.collections().create( “docs”, schema={ “text”: text().index(semantic_index()), }, )
Insert documents
client.collection(“docs”).upsert([ {“_id”: “doc-1”, “text”: “…”}, {“_id”: “doc-2”, “text”: “…”}, ])
Query using maxsim scoring with semantic_similarity
docs = client.collection(“docs”).query( select( “text, score=fn.semantic_similarity( “text”, “query string” ), ) .top_k(field(“score”), 10) )
🚀 Get started: console.topk.io 📚 Docs: docs.topk.io/guides/semantic-search
Similar Articles
@_reachsumit: No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval @Veritas2026 et al. replace vector clus…
This paper proposes Single-stage Sparse Retrieval (SSR), which replaces K-means clustering with sparse autoencoders and inverted indexing, achieving 15x faster indexing and halved retrieval latency while improving accuracy on the BEIR benchmark.
@hasantoxr: Vector databases are no longer a cloud product. They're becoming a pip install. A new open-source project called turbov…
An open-source project called turbovec has reached 10K stars on GitHub. It is a Rust-based vector index with Python bindings that uses Google Research's TurboQuant algorithm to compress embeddings to near the theoretical Shannon limit, enabling fully local RAG with 10 million documents fitting in 4 GB RAM and searching faster than FAISS.
@yifeiwang77: Thanks for sharing our work @lateinteraction @sum! The idea is extremely simple: - multi-vector retrieval is so costly …
The author shares their work on reducing the cost of multi-vector retrieval by using k-means as top-1 sparse coding. Omar Khattab adds that late-interaction sparse retrieval with neuron-level inverted indexing on unsupervised sparse autoencoders works well.
@techwith_ram: A 10M document corpus eats 31 GB of RAM as float32 Most teams hit that wall & reach for a managed vector database. $400…
turbovec is an open-source Rust vector index using Google Research's TurboQuant algorithm, achieving 16x compression and faster search than FAISS, with integrations for RAG frameworks like LangChain, LlamaIndex, and Haystack.
LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval
LogosKG introduces a hardware-aligned framework for scalable, interpretable multi-hop retrieval on billion-edge knowledge graphs, integrating degree-aware partitioning and on-demand caching to boost efficiency without sacrificing fidelity.