Tag
This paper proposes CLIF, a method using influence functions to interpret NLP models at both sample and concept levels within Concept Bottleneck Models, enabling transparent debugging and concept-level analysis.
This paper introduces a framework for token-level influence attribution in large language models by learning orthogonal latent spaces with sparse autoencoders, enabling precise identification of training data tokens that jointly influence predictions, with applications in high-stakes domains like healthcare.