gradient-attribution

#gradient-attribution

Temporal Preference Concepts and their Functions in a Large Language Model

arXiv cs.LG ↗ · 2026-06-05 Cached

This paper causally localizes a subgraph for temporal preference in a distilled LLM, finding that the model discounts the future less steeply than humans and that steering vectors can shift temporal preference, highlighting the need for explicit control mechanisms.

0 favorites 0 likes

#gradient-attribution

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

arXiv cs.CL ↗ · 2026-06-05 Cached

Introduces PRIG, a gradient attribution method that localizes prompt ambiguity in large language models by training a linear probe to distinguish clear from ambiguous prompts and attributing the probe score to token representations in the residual stream, achieving strong performance on synthetic and human-written benchmarks.

0 favorites 0 likes

#gradient-attribution

Applied Explainability for Large Language Models: A Comparative Study

arXiv cs.CL ↗ · 2026-04-20 Cached

A comparative study evaluating three explainability techniques (Integrated Gradients, Attention Rollout, SHAP) on fine-tuned DistilBERT for sentiment classification, highlighting trade-offs between gradient-based, attention-based, and model-agnostic approaches for LLM interpretability.

0 favorites 0 likes

gradient-attribution

Temporal Preference Concepts and their Functions in a Large Language Model

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

Applied Explainability for Large Language Models: A Comparative Study

Submit Feedback