Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

arXiv cs.AI Papers

Summary

This paper introduces ECC, an algorithm that calibrates semantic embeddings with limited model comparisons to cluster queries by latent capability requirements, improving LLM capability ranking quality by over 17 percentage points over baselines.

arXiv:2605.17110v1 Announce Type: new Abstract: Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:39 AM

# Capturing LLM Capabilities via Evidence-Calibrated Query Clustering
Source: [https://arxiv.org/abs/2605.17110](https://arxiv.org/abs/2605.17110)
[View PDF](https://arxiv.org/pdf/2605.17110)

> Abstract:Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability\-aware LLM evaluation\. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface\-level semantics and actual model performance\. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface\-level semantics and latent capability requirements\. ECC characterizes each cluster through a capability profile parameterized by a Bradley\-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability\-aware clustering structure that supports query\-specific inference of LLM capabilities\. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human\-labeled and embedding\-based baselines by an average of 17\.64 and 18\.02 percentage points, respectively, and proves effective in downstream tasks such as query routing\.

## Submission history

From: Fangzhou Wu \[[view email](https://arxiv.org/show-email/ec52c671/2605.17110)\] **\[v1\]**Sat, 16 May 2026 18:30:37 UTC \(1,188 KB\)

Similar Articles

Calibrating LLMs with Semantic-level Reward

arXiv cs.CL

Proposes CSR, a framework that calibrates LLMs directly in semantic space using a novel semantic calibration reward, reducing ECE by up to 40% and improving AUROC by up to 31% over verbalized-confidence baselines across multiple datasets.

Retrieval-Augmented Linguistic Calibration

arXiv cs.CL

This paper proposes Retrieval-Augmented Linguistic Calibration (RALC), a post-hoc pipeline for calibrating confidence signals in LLMs by modeling linguistic confidence as a distribution and using retrieval-augmented rewriting. It introduces Faithfulness Divergence metric and shows significant improvements across benchmarks.

Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

Hugging Face Daily Papers

Researchers introduce BEHEMOTH benchmark and CluE cluster-based prompt optimization to enable LLMs to extract and retain heterogeneous memory across diverse tasks, achieving 9% gains over prior self-evolving frameworks.