HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions
Summary
HAKARI-Bench is a lightweight benchmark for comparing retrieval methods across multiple configurations and languages, enabling efficient model selection and performance analysis. It reproduces full benchmarks like MTEB at high correlation while being faster to run.
View Cached Full Text
Cached at: 06/23/26, 09:41 AM
Paper page - HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions
Source: https://huggingface.co/papers/2606.22778
Abstract
HAKARI-Bench provides a lightweight benchmark for comparing retrieval methods across multiple configurations and languages, enabling efficient model selection and performance analysis.
With the rapid spread ofretrieval-augmented generationandsemantic search, choosing the rightembeddingandretrieval configurationis increasingly hard.Large retrieval benchmarksare comprehensive but too heavy to rerun during development, and there is little infrastructure for comparing production settings--dimensionality reduction,quantization,reranking--across many models under identical conditions. We present HAKARI-Bench, a lightweight benchmark that reconstructs existing retrieval suites into small datasets (Nano-sets): 35 benchmarks and 551 tasks across 43 languages in a unified format, enabling same-condition,model-agnostic comparisonof fiveretrieval families(BM25, dense, sparse,late interaction,rerankers) and their efficiency variants. Across 55 models, its overall ranking reproduces the officialMTEB retrievalv2,MMTEB v2retrieval, and EnglishBEIR(full) at Spearman >0.97. HAKARI-Bench does not replace full evaluation; it enables rapid model selection, regression detection, and reading the quality-efficiencyPareto frontier. Code, data, and leaderboard are released under the MIT license.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.22778
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.22778 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.22778 in a dataset README.md to link it from this page.
Spaces citing this paper1
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers
HakushoBench is a Japanese chart and table VQA benchmark built from governmental white papers to evaluate vision-language models' understanding of complex visual data, challenging open-weight models with a 58.6% accuracy and a 34.9-point gap to proprietary models.
UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval
UsefulBench introduces a domain-specific benchmark dataset that distinguishes between document relevance and usefulness for information retrieval, showing that similarity-based IR systems conflate these concepts while LLMs can address this but lack domain expertise.
@dianetc_: We set out to build a better retriever, so we looked for the hardest IR benchmarks. For each, we asked how much headroo…
The authors introduce OBLIQ-Bench, a new benchmark designed to evaluate information retrieval systems on significantly harder search queries where previous benchmarks showed little remaining headroom.
MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
Introduces MTR-Suite, a unified framework for evaluating and synthesizing conversational retrieval benchmarks, featuring an LLM-based auditor, a multi-agent pipeline for cost-effective dialogue generation, and a benchmark with high discriminative power.
Beyond Retrieval: A Multitask Benchmark and Model for Code Search
This paper introduces CoREB, a contamination-limited multitask benchmark for code search that evaluates text-to-code, code-to-text, and code-to-code retrieval with fine-tuned reranking capabilities.