Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Hugging Face Daily Papers 04/20/26, 12:00 AM Papers

Summary

Researchers apply contrastive LRP-based attribution to analyze why LLMs fail on realistic benchmarks, finding the method gives useful signals in some cases but is not universally reliable.

Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as contrastive attribution, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: https://aka.ms/Debug-XAI.

Original Article

View Cached Full Text

Cached at: 04/22/26, 06:17 AM

Paper page - Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Source: https://huggingface.co/papers/2604.17761 Published on Apr 20

Submitted byhttps://huggingface.co/rongyuan

tanon Apr 22

Abstract

Contrastive attribution methods for analyzing large language model failures show mixed effectiveness across different benchmarks and model sizes.

Interpretability tools are increasingly used to analyze failures ofLarge Language Models(LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive,LRP-based attributionas a practical tool for analyzing LLM failures in realistic settings. We formulatefailure analysisascontrastive attribution, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction ofcross-layer attribution graphsfor long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-levelcontrastive attributioncan yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLMfailure analysis. Our code is available at: https://aka.ms/Debug-XAI.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2604\.17761

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.17761 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.17761 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.17761 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Paper page - Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

Submit Feedback

Similar Articles

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs