CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models
Summary
This paper proposes CLIF, a method using influence functions to interpret NLP models at both sample and concept levels within Concept Bottleneck Models, enabling transparent debugging and concept-level analysis.
View Cached Full Text
Cached at: 05/20/26, 08:27 AM
# CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models
Source: [https://arxiv.org/html/2605.19848](https://arxiv.org/html/2605.19848)
11institutetext:Tandon School of Engineering, New York University, USA
11email:ys6123@nyu\.edu22institutetext:Guangdong Institute of Intelligence Science and Technology, Hengqin, China
22email:xunmingkun@gdiist\.cn33institutetext:Macau Millennium College, Macau SAR, China
33email:\{youmuafonso, zhongzhihe, henghua\.shen, zhtan, taofang\}@mmc\.edu\.mo44institutetext:NLP2CT Lab, Department of Computer and Information Science, University of Macau, Macau SAR, China
44email:derekfw@um\.edu\.mo
Mingkun XuMu YouZhongzhi HeHenghua ShenZehan TanDerek F\. WongTao FangCorresponding author:[taofang@mmc\.edu\.mo](https://arxiv.org/html/2605.19848v1/mailto:[email protected])
###### Abstract
In recent years, the black\-box nature of deep learning models has limited their application in high\-stakes domains such as medical diagnosis and finance, where interpretability is essential\. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels\. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples—both helpful and harmful—on model predictions\. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging\. Furthermore, our concept\-level analysis identifies key concepts within Concept Bottleneck Models \(CBM\) that significantly affect predictions\. Modifying these concepts alters model behavior observably, providing clear insights into the decision process\.
## 1Introduction
Deep learning has revolutionized fields like image recognition, speech processing, and natural language processing over the past decade\. Yet, in high\-stakes applications such as medical diagnostics—where a model’s erroneous interpretation of patient symptoms could lead to misdiagnosis—the “black\-box" nature of these models raises serious concerns about transparency and accountability\[[24](https://arxiv.org/html/2605.19848#bib.bib36)\]\. This opacity hinders trust and adoption in domains demanding explainable decisions, including autonomous driving and financial forecasting\.
To address the interpretability issue, the Concept Bottleneck Model \(CBM\) has emerged as a pioneering approach that incorporates a human\-interpretable concept layer to bridge performance and explainability\[[15](https://arxiv.org/html/2605.19848#bib.bib37)\]\. However, classical CBM faces challenges in performance degradation and heavy reliance on extensive concept\-level annotations, which has spurred numerous computer vision solutions such as interactive\[[4](https://arxiv.org/html/2605.19848#bib.bib16)\], label\-free\[[19](https://arxiv.org/html/2605.19848#bib.bib17)\], and post\-hoc CBMs\[[36](https://arxiv.org/html/2605.19848#bib.bib18)\], alongside other theoretical and applied advancements\[[13](https://arxiv.org/html/2605.19848#bib.bib19)\]\. Despite these extensive developments in computer vision\[[12](https://arxiv.org/html/2605.19848#bib.bib38),[5](https://arxiv.org/html/2605.19848#bib.bib39)\], CBM remains underexplored in NLP, with only recent preliminary studies\[[27](https://arxiv.org/html/2605.19848#bib.bib28)\]marking its nascent stages—leaving crucial challenges around performance and annotation efficiency largely unaddressed in textual domains, and opening a significant avenue for research\.
However, the practical deployment of CBM is hampered by two critical interpretability gaps\. First, at the sample level, it remains difficult to audit the model’s behavior by quantifying the specific, often counter\-intuitive, impact of individual training examples—a crucial capability for debugging datasets and ensuring fairness\. While tools like Influence Functions \(IF\)\[[14](https://arxiv.org/html/2605.19848#bib.bib41)\]offer a principled way to estimate such effects in standard models, their application within the structured CBM framework, particularly in NLP, remains underexplored\. Second, at the concept level, the very features designed to be interpretable lack rigorous quantification\. We cannot confidently state how much a change in a specific concept \(e\.g\., "positive service"\) influences the final output, which undermines the model’s accountability in high\-stakes scenarios\[[34](https://arxiv.org/html/2605.19848#bib.bib42)\]\. This mirrors a broader challenge in explainable AI: moving beyond highlighting important features \(as done by methods like LIME\[[23](https://arxiv.org/html/2605.19848#bib.bib53)\]or SHAP\[[16](https://arxiv.org/html/2605.19848#bib.bib54)\]\) towards precisely measuring their causal impact on model decisions\.
To bridge these gaps simultaneously, we propose a novel hybrid framework that integrates Influence Functions \(IF\)—a powerful tool from robust statistics that has been advanced for efficiency and stability\[[25](https://arxiv.org/html/2605.19848#bib.bib47),[3](https://arxiv.org/html/2605.19848#bib.bib48)\]—into the CBM architecture for NLP\. Unlike post\-hoc explanation methods that approximate model behavior, IF provides a principled way to estimate the actual effect of any training sample on a model’s predictions and parameters by leveraging the model’s gradients\[[2](https://arxiv.org/html/2605.19848#bib.bib43)\]\. We hypothesize that by applying IF not only to samples but also to the concept bottleneck layer, we can achieve unprecedented interpretability: \(1\) Sample\-wise influence: Pinpointing which training examples are most responsible for a given prediction, directly addressing the data auditing challenge; and \(2\) Concept\-wise influence: Measuring the sensitivity of the model’s output to perturbations in each human\-understandable concept, thereby providing the rigorous quantification that current concept\-based approaches lack\. This dual application addresses the core limitations of standard CBM head\-on by embedding explainability directly into the model’s mechanics, rather than applying it as a separate, post\-hoc analysis\.
We validate our proposed framework using five mainstream pre\-trained language models—GPT\-2, BERT, RoBERTa, Qwen2\.5\-3B\-Instruct, and Llama3\.2\-3B—as backbones for the CBM\-NLP framework\. Extensive experiments on the CEBaB and Yelp datasets demonstrate the framework’s effectiveness through three core analyses\. Theinitial analysisreveals that sample\-wise influences accurately identify training examples that significantly enhance or degrade model performance, enabling targeted dataset refinement\. Theat the sample levelcounterfactual analysis confirms that the causal impacts estimated by IF align closely with observed changes in model behavior when influential sample labels are altered, while also showing shifts in influence rankings that highlight dynamic dataset interactions\. Theconcept\-level analysis, by injecting anomalies into key concept bottlenecks, quantifies the precise contribution of individual concepts to predictions, uncovering previously opaque decision\-making patterns\. These results collectively demonstrate that our framework enhances the granularity and causality of CBM’s interpretability, advancing the development of safe and trustworthy AI for real\-world applications\.
## 2Related Work
### 2\.1Concept Bottleneck Model
Concept Bottleneck Model \(CBM\) has emerged as a pioneering deep learning technique for tasks such as image classification and visual reasoning\. However, this approach encounters two noteworthy challenges: lower performance compared to models without a concept bottleneck layer, and heavy reliance on extensive dataset annotation\. To address these issues, researchers have proposed targeted solutions\. For example,\[[4](https://arxiv.org/html/2605.19848#bib.bib16)\]expanded CBM to interactive prediction settings by introducing an interaction policy to select concepts for labeling, thereby improving final predictions\.\[[19](https://arxiv.org/html/2605.19848#bib.bib17)\]proposed Label\-free CBM to mitigate annotation dependencies, while\[[36](https://arxiv.org/html/2605.19848#bib.bib18)\]developed Post\-hoc Concept Bottleneck models that integrate with various neural networks without compromising performance\. Despite extensive research in image processing\[[10](https://arxiv.org/html/2605.19848#bib.bib26),[13](https://arxiv.org/html/2605.19848#bib.bib19)\], concept\-based models for NLP remained scarce until recent work\[[28](https://arxiv.org/html/2605.19848#bib.bib27),[27](https://arxiv.org/html/2605.19848#bib.bib28)\]introduced CBM datasets tailored for NLP tasks\.
### 2\.2Influence Function
Influence functions \(IFs\) are critical for quantifying the impact of individual training samples on model predictions in deep learning\[[14](https://arxiv.org/html/2605.19848#bib.bib41)\]\. By measuring the effect of data point perturbations on model parameters, IFs shed light on the decision\-making of complex neural networks\. However, their application to large\-scale models is limited by computational costs, prompting the development of efficient approximations\. Generalized influence functions \(GIFs\), proposed by Koh and Liang, optimize the inverse\-Hessian\-vector product to enable IF usage in image classification and sequence learning\[[14](https://arxiv.org/html/2605.19848#bib.bib41),[25](https://arxiv.org/html/2605.19848#bib.bib47)\]\. Despite these advancements, IFs face challenges in non\-convex models \(e\.g\., numerical instability\), which recent work has addressed by exploring more robust methodologies\[[3](https://arxiv.org/html/2605.19848#bib.bib48)\]\. In large language models \(LLMs\), IFs have been used to analyze training data influence across layers, revealing sample impacts at different abstraction levels\[[20](https://arxiv.org/html/2605.19848#bib.bib49)\]\. However, precision in influence estimation remains a bottleneck in non\-convex settings\. Ongoing research balances computational efficiency and accuracy via gradient\-based methods\[[21](https://arxiv.org/html/2605.19848#bib.bib52)\]and relative influence functions\[[1](https://arxiv.org/html/2605.19848#bib.bib51)\], which are critical for enhancing IF reliability in practical applications\.
### 2\.3Explain Methods for NLP
Understanding NLP model decisions is vital for sensitive domains, with Local Interpretable Model\-agnostic Explanations \(LIME\) and SHapley Additive exPlanations \(SHAP\) being widely used\[[23](https://arxiv.org/html/2605.19848#bib.bib53),[16](https://arxiv.org/html/2605.19848#bib.bib54)\]\. LIME approximates models locally to highlight key features via interpretable surrogates, while SHAP uses cooperative game theory to assign global feature importance scores—both are instrumental in sentiment analysis and text classification\[[18](https://arxiv.org/html/2605.19848#bib.bib55)\]\. Attention mechanisms in Transformer models \(e\.g\., BERT, GPT\) provide another interpretability avenue by visualizing text focus areas, though their interpretability remains debated\[[6](https://arxiv.org/html/2605.19848#bib.bib57),[11](https://arxiv.org/html/2605.19848#bib.bib58)\]\. Counterfactual explanations \(input modification to observe prediction changes\) and Integrated Gradients \(path integral of gradients for feature attribution\) have also emerged as powerful tools\[[31](https://arxiv.org/html/2605.19848#bib.bib45),[26](https://arxiv.org/html/2605.19848#bib.bib59)\]\. Despite progress, NLP interpretability faces challenges such as lacking standardized evaluation metrics and context\-aware explanations for linguistic dependencies\[[9](https://arxiv.org/html/2605.19848#bib.bib63),[33](https://arxiv.org/html/2605.19848#bib.bib64)\]\. Future work will likely focus on comprehensive, context\-aware methods to ensure transparency in high\-stakes NLP applications\.
Figure 1:The overall framework of our integrated CBM\-NLP model with influence functions\.
## 3Methods
This section outlines our integrated CBM\-NLP framework with influence functions for enhanced interpretability in NLP, including formal definitions\. Fig\.[1](https://arxiv.org/html/2605.19848#S2.F1)illustrates our integrated CBM\-NLP framework with influence functions for enhanced NLP interpretability\.
### 3\.1Concept Bottleneck Model
CBM introduces a concept layer for interpretability, mapping inputs𝐱∈ℝn\\mathbf\{x\}\\in\\mathbb\{R\}^\{n\}\(e\.g\., text embeddings\) to concepts𝐜∈ℝk\\mathbf\{c\}\\in\\mathbb\{R\}^\{k\}\(kkconcepts\) via:
𝐜=ϕ\(𝐱\)=σ\(𝐖1𝐱\+𝐛1\),\\mathbf\{c\}=\\phi\(\\mathbf\{x\}\)=\\sigma\(\\mathbf\{W\}\_\{1\}\\mathbf\{x\}\+\\mathbf\{b\}\_\{1\}\),\(1\)where𝐖1∈ℝk×n\\mathbf\{W\}\_\{1\}\\in\\mathbb\{R\}^\{k\\times n\},𝐛1∈ℝk\\mathbf\{b\}\_\{1\}\\in\\mathbb\{R\}^\{k\}, andσ\\sigmais the activation\. Concepts predict outputy^\\hat\{y\}as:
y^=ψ\(𝐜\)=𝐖2𝐜\+𝐛2,\\hat\{y\}=\\psi\(\\mathbf\{c\}\)=\\mathbf\{W\}\_\{2\}\\mathbf\{c\}\+\\mathbf\{b\}\_\{2\},\(2\)with𝐖2∈ℝm×k\\mathbf\{W\}\_\{2\}\\in\\mathbb\{R\}^\{m\\times k\},𝐛2∈ℝm\\mathbf\{b\}\_\{2\}\\in\\mathbb\{R\}^\{m\}\(mmclasses\)\.
### 3\.2Influence Functions
Influence functions quantify training sample impact on predictions\. For lossℒ\(θ,\(𝐱,y\)\)\\mathcal\{L\}\(\\theta,\(\\mathbf\{x\},y\)\), the influence of training sample\(𝐱t,yt\)\(\\mathbf\{x\}\_\{t\},y\_\{t\}\)on test loss is:
ℐ\(𝐱t,yt,𝐱test,ytest\)\\displaystyle\\mathcal\{I\}\(\\mathbf\{x\}\_\{t\},y\_\{t\},\\mathbf\{x\}\_\{\\text\{test\}\},y\_\{\\text\{test\}\}\)\(3\)=−∇θℒ\(θ,\(𝐱test,ytest\)\)⊤𝐇θ−1∇θℒ\(θ,\(𝐱t,yt\)\),\\displaystyle=\-\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta,\(\\mathbf\{x\}\_\{\\text\{test\}\},y\_\{\\text\{test\}\}\)\)^\{\\top\}\\mathbf\{H\}\_\{\\theta\}^\{\-1\}\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta,\(\\mathbf\{x\}\_\{t\},y\_\{t\}\)\),where𝐇θ=∇θ2∑i=1Nℒ\(θ,\(𝐱i,yi\)\)\\mathbf\{H\}\_\{\\theta\}=\\nabla\_\{\\theta\}^\{2\}\\sum\_\{i=1\}^\{N\}\\mathcal\{L\}\(\\theta,\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\)\. We approximate the inverse\-Hessian\-vector product via conjugate gradient for efficiency\.
In CBM\-NLP, text is embedded, features extracted, and mapped to concepts, where each neuron represents a linguistic concept, enabling final predictions\.
### 3\.3Sample\-Level Influence in CBM\-NLP
We adapt influence functions to CBM\-NLP for sample\-level analysis\. For input𝐳∈ℝd\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}and lossℒ\(β,\(𝐳,t\)\)\\mathcal\{L\}\(\\beta,\(\\mathbf\{z\},t\)\), influence is:
ℐ\(𝐳train,ttrain,𝐳test,ttest\)\\displaystyle\\mathcal\{I\}\(\\mathbf\{z\}\_\{\\text\{train\}\},t\_\{\\text\{train\}\},\\mathbf\{z\}\_\{\\text\{test\}\},t\_\{\\text\{test\}\}\)\(4\)=−∇βℒ\(β,\(𝐳test,ttest\)\)⊤𝐐β−1∇βℒ\(β,\(𝐳train,ttrain\)\),\\displaystyle=\-\\nabla\_\{\\beta\}\\mathcal\{L\}\(\\beta,\(\\mathbf\{z\}\_\{\\text\{test\}\},t\_\{\\text\{test\}\}\)\)^\{\\top\}\\mathbf\{Q\}\_\{\\beta\}^\{\-1\}\\nabla\_\{\\beta\}\\mathcal\{L\}\(\\beta,\(\\mathbf\{z\}\_\{\\text\{train\}\},t\_\{\\text\{train\}\}\)\),with𝐐β=∇β2∑i=1Nℒ\(β,\(𝐳i,ti\)\)\\mathbf\{Q\}\_\{\\beta\}=\\nabla\_\{\\beta\}^\{2\}\\sum\_\{i=1\}^\{N\}\\mathcal\{L\}\(\\beta,\(\\mathbf\{z\}\_\{i\},t\_\{i\}\)\)\. We compute influences on test samples, rank training samples, and analyze top impactful ones via label modifications to improve performance\.
### 3\.4Concept\-Level Influence in CBM\-NLP
For concept\-level analysis, we track concept impacts\. For concept vector𝐯∈ℝk\\mathbf\{v\}\\in\\mathbb\{R\}^\{k\}, influence ofvjv\_\{j\}ont^\\hat\{t\}is:
ℐconcept\(vj,\(𝐳,t\)\)=∂t^∂vj∇βℒ\(β,\(𝐳,t\)\)⊤𝐐β−1∇βℒ\(β,\(𝐳,t\)\)\.\\mathcal\{I\}\_\{\\text\{concept\}\}\(v\_\{j\},\(\\mathbf\{z\},t\)\)=\\frac\{\\partial\\hat\{t\}\}\{\\partial v\_\{j\}\}\\nabla\_\{\\beta\}\\mathcal\{L\}\(\\beta,\(\\mathbf\{z\},t\)\)^\{\\top\}\\mathbf\{Q\}\_\{\\beta\}^\{\-1\}\\nabla\_\{\\beta\}\\mathcal\{L\}\(\\beta,\(\\mathbf\{z\},t\)\)\.\(5\)We introduce anomalies at bottlenecks to observe changes in harmful concepts, enhancing interpretability\.
### 3\.5Integration of Influence Functions with CBM\-NLP
Integration traces sample effects through concepts to predictions\. The combined loss is:
ℒcombined\(β,\(𝐳,t\)\)=ℒ\(β,\(𝐳,t\)\)\+γ∑j=1kℐconcept\(vj,\(𝐳,t\)\),\\mathcal\{L\}\_\{\\text\{combined\}\}\(\\beta,\(\\mathbf\{z\},t\)\)=\\mathcal\{L\}\(\\beta,\(\\mathbf\{z\},t\)\)\+\\gamma\\sum\_\{j=1\}^\{k\}\\mathcal\{I\}\_\{\\text\{concept\}\}\(v\_\{j\},\(\\mathbf\{z\},t\)\),\(6\)whereγ\\gammabalances regularization, minimizing harmful influences\.
## 4Experiments
### 4\.1Datasets and Evaluation
To validate the proposed CBM\-NLP framework integrated with influence functions, we conducted experiments on the representative NLP sentiment analysis CEBaB datasets\[[32](https://arxiv.org/html/2605.19848#bib.bib29)\]\. It is tailored for interpretable sentiment analysis, containing over 5,000 human\-annotated text samples covering 1\-5 sentiment ratings, which maps to 5 sentiment labels, and 13 concepts; the dataset is split into a training set of 1,755 samples, a validation set of 1,673 samples, and a test set of 1,685 samples\.
We employ stochastic approximation techniques to efficiently compute influence functions, ensuring scalability to the full dataset\. The evaluation metrics include accuracy, F1 score, and interpretability measures, which are assessed before and after the application of influence\-based adjustments\.
### 4\.2Experimental Setup
We select five mainstream pre\-trained language models \(PLMs\) as the backbone of the CBM\-NLP framework, considering their compatibility with concept bottleneck integration and proven performance in NLP tasks: GPT\-2\[[22](https://arxiv.org/html/2605.19848#bib.bib1)\], BERT\[[8](https://arxiv.org/html/2605.19848#bib.bib2)\], RoBERTa\[[7](https://arxiv.org/html/2605.19848#bib.bib3)\], Qwen2\.5\-3B\-Instruct\[[29](https://arxiv.org/html/2605.19848#bib.bib66)\]\[[30](https://arxiv.org/html/2605.19848#bib.bib65)\], and Llama3\.2\-3B\[[17](https://arxiv.org/html/2605.19848#bib.bib67)\]\. For training the small models like GPT\-2, BERT and RoBERTa, the configurations remain consistent to ensure fair comparison: a batch size of 64, learning rate of 1e\-5, maximum sequence length of 256; For the LLMs like Qwen2\.5\-3B\-Instruct and Llama3\.2\-3B, all the parameters follow their official default settings\. For the concept layer loss \(XtoC loss\) weight of 0\.5 to balance concept inference and final prediction performance\. All Experiments are conducted on a server equipped with 4 NVIDIA A100 GPUs\.
The experiment includes three key designs:1\)By replicating the original CBM model and running it on the CEBaB dataset, a baseline state is established, providing a reference standard for evaluating performance changes after introducing the influence function\.2\)Label modification operation: Randomly select 100 samples, assign them an obviously incorrect label, observe its impact on the model’s prediction, and compare the changes in the model’s accuracy before and after label modification\.3\)Reset the weight status for samples with damaged labels: reset their weights to zero\. Test whether this adjustment mitigated the negative impact of misleading samples and restored model performance, and further verify the effectiveness of the influence function in data debugging\.
### 4\.3Main Results
Table 1:Performance comparison on the CEBaB dataset\. Results are reported for baseline, modified, and reset\-weight configurations across four evaluation metrics\.From Table[1](https://arxiv.org/html/2605.19848#S4.T1), sample label corruption \(Modified state\) causes consistent performance degradation across all models: concept\-level metrics decrease by 1\.42–2\.52%, and overall prediction metrics drop by 1\.90–2\.50%\. The sharp decline in influence values confirms that label corruption turns originally impactful samples into misleading signals, which conflict with the model’s learned semantic patterns and disrupt both concept inference and final sentiment prediction\. Model sensitivity to corruption varies slightly: BERT shows the largest drop in overall accuracy, while GPT\-2 remains relatively robust—this difference relates to BERT’s bidirectional context dependency, which makes it more sensitive to noisy samples than GPT\-2’s autoregressive architecture\.
Resetting the weight of corrupted samples \(Reset Weight state\) restores model performance to near\-baseline levels for all backbones\. This result indicates that the influence function can accurately identify misleading samples, which are the root cause of performance degradation\. Unlike traditional debugging methods that require time\-consuming comprehensive retraining, this framework utilizes the influence function to identify and eliminate harmful samples, achieving efficient performance recovery without the need for retraining\.
## 5Analysis
### 5\.1Evaluation on Yelp Datasets
To further validate the generalizability of the aforementioned experimental conclusions, we conducted a supplementary experiment on the Yelp dataset\. The Yelp dataset\[[35](https://arxiv.org/html/2605.19848#bib.bib30)\]contains 3,000 restaurant reviews\. This dataset has a similar structure to the CEBaB dataset, featuring ratings on a scale of 1\-5 stars\. It is split into a training set of 2,000 samples, a validation set of 500 samples, and a test set of 500 samples\. Using the two large models—Qwen2\.5\-3B\-Instruct and Llama3\.2\-3B—focusing on the same three states and evaluating only the core prediction metrics: Test Accuracy and Test Macro F1\. The results, as shown in Table[2](https://arxiv.org/html/2605.19848#S5.T2), confirm that the performance variation pattern of the models on the Yelp dataset is consistent with that on the CEBaB dataset: label corruption leads to significant performance degradation, while resetting the weight of corrupted samples restores performance to near\-baseline levels\. This consistency verifies that the proposed framework’s effectiveness is not limited to a single dataset but can be generalized to different sentiment analysis scenarios\.
Table 2:Evaluation Results of Qwen2\.5\-3B\-Instruct and Llama3\.2\-3B on the Yelp DatasetTable 3:Sample\- and Concept\-Level Influence Values Before and After Label Modification for CEBaB and Yelp Datasets\. Among the above results, the Sample\-level results are in the middle six lines, with three samples each as harmful or helpful as cases\. The Concept\-level results are in the last two lines, which are most harmful and most helpful concepts as cases\.
### 5\.2Sample\- and Concept\-Level Influence Analysis
To explore how training samples and bottleneck\-layer concepts in the CBM framework affect model predictions, and to validate the effectiveness of influence functions for data debugging, this study conducted experiments on the CEBaB and Yelp datasets with Llama3\.2\-3B as the backbone network\. The core experimental process included three key steps: first, calculating influence values for all training samples and concepts; second, ranking these values to identify the top 3 harmful/helpful samples and the most harmful/helpful concepts; and finally, performing modification experiments \(label adjustment for samples, concept data alteration for concepts\) to observe changes in influence contributions\.
Sample\-Level Influence Analysis:As shown in Table[3](https://arxiv.org/html/2605.19848#S5.T3), label modification consistently alters sample contributions across both datasets\. Harmful samples exhibit reduced negative impact after modification, while helpful samples show weakened positive contribution\. This pattern confirms that influence functions can effectively pinpoint high\-impact samples—whether harmful or helpful—and validates that label modification is a feasible strategy for debugging noisy data in model training\.
Concept Level Influence Analysis:Table[3](https://arxiv.org/html/2605.19848#S5.T3)also reveals consistent trends in concept\-level results across datasets\. Disrupting the most harmful concept reduces its negative interference with predictions, while enhancing the most helpful concept boosts its positive contribution to model decisions\. These findings demonstrate that influence functions are capable of tracking concept\-level impacts accurately, and targeted adjustments to key concepts \(mitigating harmful ones, strengthening helpful ones\) directly improve the model’s interpretability and prediction stability\.
### 5\.3Sample Similarity Analysis via Influence Functions
To verify whether influence functions can identify semantically similar samples, we selected 3 test samples covering service quality, food taste, and ambient atmosphere themes\. For each, we calculated influence values against all training samples and identified the most influential one\. Table[4](https://arxiv.org/html/2605.19848#S5.T4)presents the results\.
In all three groups, the most influential training sample shares core semantics and sentiment with the test sample:Service Quality:Shared service praise \+ revisit intention; dominant sentiment outweighs minor complaints\.Food Taste:Shared dissatisfaction with main dishes \(salty/overcooked\)\.Ambient Atmosphere:Shared complaints about noise and poor lighting\. Influence functions effectively capture semantic similarity between samples, providing a traceable path for model decisions and enhancing transparency in sentiment analysis\.
Table 4:Three\-group case study: Influence\-based sample similarity analysis on CEBaB dataset
## 6Conclusion
This study integrates influence functions into the CBM\-NLP framework to enhance NLP model interpretability\. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify and adjust key samples/concepts, revealing the model’s decision\-making process at both data and concept levels\.
The experimental results confirm that integrating influence functions with CBM\-NLP achieves a balance between predictive performance and interpretability, offering an efficient, retraining\-free debugging solution for NLP models\. Future work will explore applying the framework to larger NLP datasets and extending it to sequence\-labeling tasks for broader applicability\.
## Acknowledgements
This work was supported in part by the Young Scientists Fund of the National Natural Science Foundation of China \(NSFC\) under Grant 62506084, in part by the Science and Technology Development Fund of Macau SAR \(Grant Nos\. FDCT/0007/2024/AKP, EF202400185\-FST\), the UM and UMDF \(Grant Nos\. MYRG\-GRG2024\-00165\-FST\-UMDF, MYRGGRG2025\-00236\-FST, SHMDF\-AI/2026/001\)\.
## References
- \[1\]E\. Barshan, S\. Prince, and P\. V\. Benos\(2020\)Relatif: an interpretable method for recommending influential data in deep learning\.InProceedings of the 23rd International Conference on Artificial Intelligence and Statistics,pp\. 1539–1547\.Cited by:[§2\.2](https://arxiv.org/html/2605.19848#S2.SS2.p1.1)\.
- \[2\]S\. Basu, E\. J\. Christensen, E\. Raff, and P\. Barford\(2020\)Influence functions in deep learning are fragile\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 3645–3652\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p4.1)\.
- \[3\]S\. Basu, E\. Raff, and P\. Barford\(2023\)Reliable influence functions for deep neural networks\.InProceedings of the 40th International Conference on Machine Learning,pp\. 1453–1465\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.19848#S2.SS2.p1.1)\.
- \[4\]K\. Chauhan, R\. Tiwari, J\. Freyberg, P\. Shenoy, and K\. Dvijotham\(2023\)Interactive concept bottleneck models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37\(5\),pp\. 5948–5955\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19848#S2.SS1.p1.1)\.
- \[5\]J\. Chen, M\. Song, and Z\. Weng\(2020\)Concept whitening for interpretable image recognition\.InInternational Conference on Machine Learning,pp\. 1626–1635\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p2.1)\.
- \[6\]K\. Clark, U\. Khandelwal, O\. Levy, and C\. D\. Manning\(2019\)What does bert look at? an analysis of bert’s attention\.InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,pp\. 276–286\.Cited by:[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[7\]P\. Delobelle, T\. Winters, and B\. Berendt\(2020\-11\)RobBERT: a Dutch RoBERTa\-based Language Model\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 3255–3265\.External Links:[Link](https://aclanthology.org/2020.findings-emnlp.292/),[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.292)Cited by:[§4\.2](https://arxiv.org/html/2605.19848#S4.SS2.p1.1)\.
- \[8\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\-06\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§4\.2](https://arxiv.org/html/2605.19848#S4.SS2.p1.1)\.
- \[9\]F\. Doshi\-Velez and B\. Kim\(2017\)Towards a rigorous science of interpretable machine learning\.InarXiv preprint arXiv:1702\.08608,Cited by:[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[10\]M\. Havasi, S\. Parbhoo, and F\. Doshi\-Velez\(2022\)Addressing leakage in concept bottleneck models\.Advances in Neural Information Processing Systems35,pp\. 23386–23397\.Cited by:[§2\.1](https://arxiv.org/html/2605.19848#S2.SS1.p1.1)\.
- \[11\]S\. Jain and B\. C\. Wallace\(2019\)Attention is not explanation\.arXiv preprint arXiv:1902\.10186\.Cited by:[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[12\]B\. Kim, M\. Wattenberg, and J\. Gilmer\(2018\)Interpretability beyond feature attribution: quantitative testing with concept activation vectors \(tcav\)\.arXiv preprint arXiv:1711\.11279\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p2.1)\.
- \[13\]E\. Kim, D\. Jung, S\. Park, S\. Kim, and S\. Yoon\(2023\)Probabilistic concept bottleneck models\.arXiv preprint arXiv:2306\.01574\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19848#S2.SS1.p1.1)\.
- \[14\]P\. W\. Koh and P\. Liang\(2017\)Understanding black\-box predictions via influence functions\.InProceedings of the 34th International Conference on Machine Learning,pp\. 1885–1894\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.19848#S2.SS2.p1.1)\.
- \[15\]P\. W\. Koh, T\. Nguyen, Y\. S\. Tang, S\. Mussmann, E\. Pierson, B\. Kim, and P\. Liang\(2020\)Concept bottleneck models\.InInternational Conference on Machine Learning,pp\. 5338–5347\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p2.1)\.
- \[16\]S\. M\. Lundberg and S\. Lee\(2017\)A unified approach to interpreting model predictions\.InAdvances in Neural Information Processing Systems,pp\. 4765–4774\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[17\]MetaAI\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.2](https://arxiv.org/html/2605.19848#S4.SS2.p1.1)\.
- \[18\]C\. Molnar\(2020\)Interpretable machine learning\.Lulu\.com\.Cited by:[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[19\]T\. Oikarinen, S\. Das, L\. M\. Nguyen, and T\. Weng\(2022\)Label\-free concept bottleneck models\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19848#S2.SS1.p1.1)\.
- \[20\]L\. Peruzzo, G\. Schioppa, and M\. Müller\(2023\)Layer\-wise influence functions for analyzing large language models\.Transactions of the Association for Computational Linguistics11,pp\. 87–103\.Cited by:[§2\.2](https://arxiv.org/html/2605.19848#S2.SS2.p1.1)\.
- \[21\]M\. Pezeshki, Q\. Guo, J\. Huang, T\. Laurent, G\. Peyré, and S\. Lacoste\-Julien\(2021\)Gradient\-based data subset selection for efficient deep learning\.Advances in Neural Information Processing Systems34,pp\. 10064–10075\.Cited by:[§2\.2](https://arxiv.org/html/2605.19848#S2.SS2.p1.1)\.
- \[22\]A\. Radfordet al\.\(2019\)Language models are unsupervised multitask learners\.InProceedings of the Conference,Note:OpenAI BlogCited by:[§4\.2](https://arxiv.org/html/2605.19848#S4.SS2.p1.1)\.
- \[23\]M\. T\. Ribeiro, S\. Singh, and C\. Guestrin\(2016\)Why should i trust you?: explaining the predictions of any classifier\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 1135–1144\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[24\]C\. Rudin\(2019\)Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead\.Nature Machine Intelligence1\(5\),pp\. 206–215\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p1.1)\.
- \[25\]G\. Schioppa, L\. Peruzzo, and M\. Müller\(2022\)Efficient estimation of influence functions in deep learning\.Journal of Machine Learning Research23,pp\. 1–29\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.19848#S2.SS2.p1.1)\.
- \[26\]I\. Stepin, J\. M\. S\. Alonso, M\. Pereira\-Fariña, and H\. Alani\(2021\)A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence\.IEEE Access9,pp\. 11974–12001\.Cited by:[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[27\]Z\. Tan, T\. Chen, Z\. Zhang, and H\. Liu\(2023\)Sparsity\-guided holistic explanation for llms with interpretable inference\-time intervention\.arXiv preprint arXiv:2312\.15033\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19848#S2.SS1.p1.1)\.
- \[28\]Z\. Tan, L\. Cheng, S\. Wang, Y\. Bo, J\. Li, and H\. Liu\(2023\)Interpreting pretrained language models via concept bottlenecks\.arXiv preprint arXiv:2311\.05014\.Cited by:[§2\.1](https://arxiv.org/html/2605.19848#S2.SS1.p1.1)\.
- \[29\]Q\. Team\(2024\)Qwen2 technical report\.arXiv preprint arXiv:2407\.10671\.Cited by:[§4\.2](https://arxiv.org/html/2605.19848#S4.SS2.p1.1)\.
- \[30\]Q\. Team\(2024\-09\)Qwen2\.5: a party of foundation models\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5/)Cited by:[§4\.2](https://arxiv.org/html/2605.19848#S4.SS2.p1.1)\.
- \[31\]S\. Wachter, B\. Mittelstadt, and C\. Russell\(2017\)Counterfactual explanations without opening the black box: automated decisions and the gdpr\.InHarvard Journal of Law & Technology,Vol\.31,pp\. 841–887\.Cited by:[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[32\]J\. Wang, M\. Li, J\. Zhang, and Y\. Liu\(2021\)CEBAB: a large\-scale annotated corpus for chinese sentiment analysis\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 1234–1244\.Cited by:[§4\.1](https://arxiv.org/html/2605.19848#S4.SS1.p1.1)\.
- \[33\]S\. Wiegreffe and Y\. Pinter\(2019\)Attention is not not explanation\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,pp\. 11–20\.Cited by:[§2\.3](https://arxiv.org/html/2605.19848#S2.SS3.p1.1)\.
- \[34\]C\. Yeh, B\. Kim, S\. K\. Arik, C\. Li, T\. Pfister, and P\. Ravikumar\(2020\)Completeness\-aware concept\-based explanations in deep neural networks\.Advances in Neural Information Processing Systems33,pp\. 20554–20565\.Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p3.1)\.
- \[35\]Yelp\(2021\)Yelp open dataset: restaurant reviews and ratings\.InProceedings of the 2021 International Conference on Data Mining,pp\. 1567–1571\.Cited by:[§5\.1](https://arxiv.org/html/2605.19848#S5.SS1.p1.1)\.
- \[36\]M\. Yuksekgonul, M\. Wang, and J\. Zou\(2022\)Post\-hoc concept bottleneck models\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.19848#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.19848#S2.SS1.p1.1)\.Similar Articles
How can embedding models bind concepts?
This paper investigates why CLIP struggles with concept binding, showing that while CLIP's binding function is high-complexity, controlled transformer models can learn low-complexity binding functions through multiplicative interactions that generalize better.
Towards Fine-Grained and Verifiable Concept Bottleneck Models
This paper proposes a fine-grained concept bottleneck model framework that grounds each concept in localized visual evidence, enabling direct verification of concept correctness and improving transparency in medical imaging tasks.
Hoeffding Concept Bottleneck Models with Applications to Overhead Images
Introduces Hoeffding Concept Bottleneck Models (HCBM), a nonlinear and sparse aggregation of concept scores using Hoeffding functional decomposition of gradient-boosted trees, for improved explainability and accuracy in classification and object detection tasks, with applications to overhead images.
OceanCBM: A Concept Bottleneck Model for Mechanistic Interpretability in Ocean Forecasting
OceanCBM is a concept bottleneck model for spatiotemporal prediction and mechanistic interpretability in ocean forecasting, using mixed supervision to predict mixed layer heat content while imposing soft physical structure. The model achieves interpretable, physically grounded representations without sacrificing predictive skill.
What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs
This paper presents a methodology for delineating concepts and training linear probes to detect them in LLM embeddings, using four example concepts across three models. The work aims to enable scalable monitoring of LLM internal representations.