Tag
This paper introduces the Attribution Contract, a specification for feature-attribution claims in generative language models, addressing ambiguities in what constitutes a feature and how attribution methods should be evaluated. It uses autoregressive and diffusion models as case studies to show when attribution is informative or misleading.
This paper proves that no feature ranking can be simultaneously faithful, stable, and complete under collinearity, characterizing the full attribution design space and providing a formally verified impossibility theorem in explainable AI.
A 16-year-old developer created sage-explainer, a Python package that approximates prediction sensitivity to features for black-box models like random forests and XGBoost, offering more stable results than centered finite differences.
Introduces a weight perturbation-based feature attribution method (XWP and XWPc) for fully connected neural networks, achieving competitive performance on standard baseline metrics.
Researchers introduce PIE, a CLT-native framework for efficient circuit discovery via feature attribution-based pruning, achieving ~40× compression in feature selection while maintaining behavioral fidelity on IOI and Doc-String tasks.