Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Summary
Researchers apply contrastive LRP-based attribution to analyze why LLMs fail on realistic benchmarks, finding the method gives useful signals in some cases but is not universally reliable.
View Cached Full Text
Cached at: 04/22/26, 06:17 AM
Paper page - Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Source: https://huggingface.co/papers/2604.17761 Published on Apr 20
·
Submitted byhttps://huggingface.co/rongyuan
tanon Apr 22
Abstract
Contrastive attribution methods for analyzing large language model failures show mixed effectiveness across different benchmarks and model sizes.
Interpretability tools are increasingly used to analyze failures ofLarge Language Models(LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive,LRP-based attributionas a practical tool for analyzing LLM failures in realistic settings. We formulatefailure analysisascontrastive attribution, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction ofcross-layer attribution graphsfor long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-levelcontrastive attributioncan yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLMfailure analysis. Our code is available at: https://aka.ms/Debug-XAI.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2604\.17761
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.17761 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.17761 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.17761 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance
This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.
From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents
This paper characterizes 'false success' in LLM agents, where agents claim task completion despite environment state showing otherwise, finding it accounts for 45-75% of failures across benchmarks. LLM judges fail to detect this reliably, while lightweight TF-IDF detectors achieve high AUROC with much lower latency, suggesting production monitoring should use calibrated detectors instead of LLM judges.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs
This paper audits eight automatic attribution metrics across three evaluation constructs for RAG systems, finding that no single metric transfers across datasets within the same construct, challenging the common practice of treating them as interchangeable.