Tag
This paper causally localizes a subgraph for temporal preference in a distilled LLM, finding that the model discounts the future less steeply than humans and that steering vectors can shift temporal preference, highlighting the need for explicit control mechanisms.
Introduces PRIG, a gradient attribution method that localizes prompt ambiguity in large language models by training a linear probe to distinguish clear from ambiguous prompts and attributing the probe score to token representations in the residual stream, achieving strong performance on synthetic and human-written benchmarks.
A comparative study evaluating three explainability techniques (Integrated Gradients, Attention Rollout, SHAP) on fine-tuned DistilBERT for sentiment classification, highlighting trade-offs between gradient-based, attention-based, and model-agnostic approaches for LLM interpretability.