Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Hugging Face Daily Papers 05/16/26, 12:00 AM Papers

Summary

This paper proposes Evidence-Calibrated Query Clustering (ECC), an algorithm that aligns semantic embeddings with latent LLM capability demands using posterior model comparisons and Bradley-Terry modeling, significantly improving capability ranking quality for LLM evaluation.

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

Original Article

View Cached Full Text

Cached at: 05/22/26, 02:32 AM

Paper page - Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Source: https://huggingface.co/papers/2605.17110

Abstract

Query clustering algorithm ECC improves LLM capability evaluation by aligning semantic embeddings with latent capability demands through posterior model comparisons and Bradley-Terry modeling.

Query clusteringorganizes queries into groups that reflect sharedlatent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates priorsemantic embeddingsusing limitedposterior model comparisonsto bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by aBradley-Terry modeland usestrainable mixture weightsto accommodate queries with mixed capability demands, jointly learning a flexible,capability-aware clusteringstructure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improvesLLM capability rankingquality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such asquery routing.

View arXiv page View PDF Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.17110 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.17110 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.17110 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Paper page - Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

30B+ tokens with Xiaomi MiMo v2.5 Pro: switched from Claude/GPT for agentic browser automation (and the .md workflow that keeps it stable)

Use context profiler to optimize your LLM calls and reduce token use

What breaks the most when you call LLM APIs in production?

Building an Open Source Edge Semantic Cache for LLMs in Rust/WASM – Sanity check on the architecture? [D]

Submit Feedback

Similar Articles

30B+ tokens with Xiaomi MiMo v2.5 Pro: switched from Claude/GPT for agentic browser automation (and the .md workflow that keeps it stable)

Use context profiler to optimize your LLM calls and reduce token use

What breaks the most when you call LLM APIs in production?

@GitHub_Daily: For those in quantitative research, daily facing massive financial reports and cutting-edge papers, manually filtering valuable content is like finding a needle in a haystack. Recently discovered an open-source project called QuantMind, focused on intelligent knowledge extraction and retrieval for quantitative finance. It can automatically fetch papers, news, blogs, and turn unstructured documents into searchable...

Building an Open Source Edge Semantic Cache for LLMs in Rust/WASM – Sanity check on the architecture? [D]