ECI_{sem}: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives

Hugging Face Daily Papers Papers

Summary

ECI_sem is a training-free method for ranking hard negative sources in dense retrieval using frozen embeddings, achieving strong performance on MS MARCO and BEIR benchmarks.

Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose ECI_{sem}, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. ECI_{sem} is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. ECI_{sem} builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family ECI_{sem} ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.
Original Article
View Cached Full Text

Cached at: 06/08/26, 11:18 PM

Paper page - ECI_{sem}: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives

Source: https://huggingface.co/papers/2603.20990 Published on Jun 5

·

Submitted byhttps://huggingface.co/chungimungi

Aarushon Jun 8

Abstract

ECI_sem, a semantic residual variant of Effective Contrastive Information, ranks negative sources for dense retrieval using frozen embeddings without requiring training, achieving strong performance on MS MARCO and BEIR benchmarks.

Hard-negative source selectionfordense retrievalis usually decided only after fine-tuning anddownstream evaluation. We propose ECI_{sem}, asemantic residualvariant ofEffective Contrastive Information(ECI) that ranks candidate negative sources using frozen target-encoder embeddings. ECI_{sem} is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. ECI_{sem} builds aweighted residual information matrixfromtarget consistency,semantic locality,lexical residuality, and alog-determinant diversityobjective. OnMS MARCOnegative sources, in-family ECI_{sem} ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregateBEIRtransfer results acrossDistilBERT,E5-base, andContriever.Controlled ablationsshow that this alignment depends on using thetarget encoderfamily, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treatsdownstream evaluationas the final test.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2603\.20990

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2603.20990 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2603.20990 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2603.20990 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

arXiv cs.CL

Sem-Detect introduces a method to distinguish AI-generated peer reviews from human-written ones by combining textual features with claim-level semantic analysis. It achieves a 25.5% improvement in true positive rate at 0.1% false positive rate over baselines, and shows that LLM-refined human reviews retain distinct semantic signals, with fewer than 3.5% misclassified as AI-generated.

When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE

arXiv cs.LG

The paper identifies a misalignment between the softmax-based InfoNCE loss and the normalized embedding setting in modern contrastive learning. It proposes WEINCE, a simple modification that blends softmax logits with an endpoint shortfall correction using extreme value theory, yielding consistent improvements across vision benchmarks.