From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

arXiv cs.CL Papers

Summary

This paper proposes a framework for sentence-level interpretability of rubric-based scoring, comparing SHAP and LLM-generated rationales. It finds that fine-tuned pretrained language models outperform LLMs in prediction accuracy, and SHAP provides more faithful and transferable explanations.

arXiv:2606.05180v1 Announce Type: new Abstract: Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:05 AM

# Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment
Source: [https://arxiv.org/html/2606.05180](https://arxiv.org/html/2606.05180)
## From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric\-based Teaching Quality Assessment

Ivo Bueno1,2Babette Bühler1, 2Philipp Stark3Tim Fütterer4 Ulrich Trautwein4Dorottya Demszky5Heather Hill6Enkelejda Kasneci1,2

1Technical University of Munich2Munich Center for Machine Learning \(MCML\) 3Lund University4University of Tübingen5Stanford Graduate School of Education 6Harvard Graduate School of Education Correspondence:[ivo\.bueno@tum\.de](https://arxiv.org/html/2606.05180v1/mailto:[email protected])

###### Abstract

Automated scoring models are increasingly used to assign rubric\-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced\. We propose a general framework for sentence\-level interpretability of rubric\-based scoring that combines model\-agnostic Shapley value attributions with rationales generated by large language models \(LLMs\)\. Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine\-tuned pretrained language models \(PLMs\) and prompted LLMs on both scoring performance and explanation faithfulness\. Across 6k annotated transcript segments, fine\-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid\-scale scores\. Deletion\-based tests show that SHAP identifies sentences that reliably drive model predictions, typically producing larger and more coherent prediction shifts than LLM\-generated rationales\. Cross\-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence\. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric\-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high\-stakes educational settings and other rubric\-based language assessment tasks\.

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric\-based Teaching Quality Assessment

Ivo Bueno1,2Babette Bühler1, 2Philipp Stark3Tim Fütterer4Ulrich Trautwein4Dorottya Demszky5Heather Hill6Enkelejda Kasneci1,21Technical University of Munich2Munich Center for Machine Learning \(MCML\)3Lund University4University of Tübingen5Stanford Graduate School of Education6Harvard Graduate School of EducationCorrespondence:[ivo\.bueno@tum\.de](https://arxiv.org/html/2606.05180v1/mailto:[email protected])

## 1Introduction

Rubric\-based scoring models are increasingly used to automatically evaluate open\-ended language tasks, from student essays and peer feedback to clinical notes and classroom transcripts\. In these settings, models assign scalar scores on multi\-level rubrics that inform teaching, evaluation, and policy decisions, yet most systems provide little insight into why a particular score was assigned\. This is especially problematic in high\-stakes educational contexts, where stakeholders such as teachers must be able to understand, trust, and contest automated judgments—requirements that are now explicitly reflected in emerging regulatory frameworks such as the EU AI Act\(European Parliament and Council of the European Union,[2024](https://arxiv.org/html/2606.05180#bib.bib31)\)\. Therefore, a central challenge emerges:how can we trust rubric\-based scores produced by opaque, black\-box models such as large language models \(LLMs\) when their internal decision\-making is inaccessible and their explanations may be unfaithful?Recent work suggests that free\-form explanations produced by LLMs can be persuasive without faithfully reflecting the underlying computation\(Turpinet al\.,[2023](https://arxiv.org/html/2606.05180#bib.bib25); Ye and Durrett,[2022](https://arxiv.org/html/2606.05180#bib.bib26)\), thus raising a critical question for explainable NLP:which parts of a text truly drive a model’s rubric\-based score, and how can we evaluate whether an explanation has captured them?

High\-quality feedback is central to teachers’ professional growth, yet providing consistent and individualized feedback is resource\-intensive and prone to inconsistency\. Recent work demonstrates that automated feedback tools can enhance teachers’ uptake of student ideas by as much as 24%\(Demszkyet al\.,[2024](https://arxiv.org/html/2606.05180#bib.bib1)\), showing the promise of NLP\-based approaches for supporting teacher development\. Classroom teaching quality is thus a prototypical example of a high\-stakes, rubric\-based judgment where opaque scores are insufficient, and explanations are critical\.

Automated scoring of teaching quality dimensions has likewise proven feasible\(Houet al\.,[2024](https://arxiv.org/html/2606.05180#bib.bib3); Füttereret al\.,[2026](https://arxiv.org/html/2606.05180#bib.bib32)\), but scoring alone does not provide insight into the reasoning behind a model’s evaluation\. Moving fromwhat\(the score\) towhy\(the reasoning\) is essential for generating actionable feedback and fostering user trust\. However, whereas LLMs can generate rich, sentence\-level rationales, a growing body of work shows that such explanations often fail to reflect the model’s actual decision processTurpinet al\.\([2023](https://arxiv.org/html/2606.05180#bib.bib25)\); Ye and Durrett \([2022](https://arxiv.org/html/2606.05180#bib.bib26)\)\. Despite the rapid progress of LLMs and transformer\-based scoring systems, their decision\-making processes often remain opaque\.

To bridge this gap, we investigate explainable NLP methods that can reveal which parts of classroom dialogue most strongly influence automated teaching quality assessments\. Specifically, we propose a unified framework for sentence\-level interpretability of rubric\-based teaching quality scores, comparing model\-agnostic feature attribution using SHAP\(Lundberg and Lee,[2017](https://arxiv.org/html/2606.05180#bib.bib7)\)with LLM\-based reasoning to identify aspects of teacher–student interaction that contribute to high\- or low\-quality feedback\. Our study focuses on theQuality of Feedbackdimension, evaluating the quality of the feedback given by teachers to their students in the classroom, within theInstructional Supportdomain of the Classroom Assessment Scoring System \(CLASS\) framework\(Piantaet al\.,[2008](https://arxiv.org/html/2606.05180#bib.bib8)\), as providing feedback is a core teaching practice and exhibits a balanced label distribution in our dataset\.

We evaluate fine\-tuned transformer\-based models and LLMs on the NCTE dataset\(Demszky and Hill,[2023](https://arxiv.org/html/2606.05180#bib.bib9)\), containing elementary mathematics classroom transcripts annotated by expert observers\. Beyond comparing model performance, our work examines the faithfulness and consistency of different explanation methods by systematically removing sentences highlighted as important, either by SHAP or by LLM\-generated rationales, and measuring how these removals change predictions\. We further introduce a cross\-model evaluation protocol where explanations generated for one model family are used to perturb inputs for the other, allowing us to study whether explanations transfer across architectures or remain model\-specific\.

Our work contributes to the growing body of research on explainable NLP and reliable LLM rationales in education and other rubric\-based assessment settings by:

1. 1\.Proposing a general framework for sentence\-level interpretability of rubric\-based scoring models that combines model\-agnostic Shapley value attributions with LLM\-generated rationales, instantiated on automated teaching quality assessment\.
2. 2\.Comparing specialized fine\-tuned models and LLM prompting for teaching quality scoring\.
3. 3\.Evaluating the faithfulness of SHAP and LLM\-based explanations through deletion\-based tests, assessing how influential the identified sentences truly are for model predictions\.
4. 4\.Introducing cross\-model consistency analyses, where LLM\-selected sentences are removed and evaluated with fine\-tuned models \(and vice versa\), to probe the alignment of explanation methods across architectures\.
5. 5\.Discussing the implications of our findings for the design of transparent, actionable teacher feedback tools and, more broadly, for the use of LLM rationales as explanations in rubric\-based educational assessments\.

We address the following research questions \(RQs\):

- •RQ1:How do fine\-tuned transformer\-based pretrained language models \(PLMs\) compare to prompted LLMs in predicting rubric\-based teaching quality scores on the Quality of Feedback dimension?
- •RQ2:How faithful and reliable are SHAP\- and LLM\-based sentence\-level explanations in identifying influential parts of a classroom transcript, as measured by deletion\-based changes in predictions?
- •RQ3:To what extent do explanations transfer across model types, i\.e\., does removing sentences identified by one model meaningfully affect the predictions of another?

## 2Related Work

### 2\.1Explainability Methods for NLP Models

Explainable NLP methods aim to identify input components that most strongly influence model predictions, a crucial requirement in educational contexts where transparency is essential\. Model\-agnostic approaches, e\.g\., LIME\(Ribeiroet al\.,[2016](https://arxiv.org/html/2606.05180#bib.bib22)\)and SHAP\(Lundberg and Lee,[2017](https://arxiv.org/html/2606.05180#bib.bib7)\), provide local feature attributions by approximating complex classifiers or computing Shapley values\. Beyond NLP, SHAP has also been applied in other domains to improve model interpretability, including work that combines SHAP explanations with LLM\-generated descriptions to enhance human\-understandable rationales\(Khediriet al\.,[2024](https://arxiv.org/html/2606.05180#bib.bib30)\)\. In our work, we use SHAP at a sentence embedding level within a hierarchical PLM architecture, treating sentences as features to obtain document\-level attributions that are both computationally tractable and directly actionable for feedback\.

For neural text models, attention\-based interpretations have been debated due to concerns about whether attention weights reflect causal importance\(Jain and Wallace,[2019](https://arxiv.org/html/2606.05180#bib.bib23); Wiegreffe and Pinter,[2019](https://arxiv.org/html/2606.05180#bib.bib24)\)\. To address these limitations, research increasingly distinguishes plausibility from faithfulness\(Jacovi and Goldberg,[2020](https://arxiv.org/html/2606.05180#bib.bib29)\)\. Deletion\- and perturbation\-based evaluation, i\.e\., removing influential input elements and observing prediction changes, provides a more direct measure of explanation faithfulness\(DeYounget al\.,[2020](https://arxiv.org/html/2606.05180#bib.bib28)\)\. We adopt this perspective and extend it to a cross\-model setting: explanations are evaluated not only with respect to their source model, but also by measuring how they perturb predictions of alternative architectures\. Furthermore, our work contributes by applying sentence\-level SHAP in a hierarchical PLM setting and evaluating its faithfulness through systematic deletion tests\.

Recent work in educational assessment has also explored the use of SHAP to interpret rubric\-based scoring models\. For example,Boulanger and Kumar \([2020](https://arxiv.org/html/2606.05180#bib.bib39)\)apply SHAP to automated essay scoring to quantify the contribution of linguistic features to rubric\-level predictions, enabling both local and global interpretability of model behavior\. Similarly,Kumar and Boulanger \([2020](https://arxiv.org/html/2606.05180#bib.bib40)\)demonstrate that combining deep learning with SHAP can expose the decision\-making process of rubric\-based scoring systems and support the generation of fine\-grained, formative feedback aligned with pedagogical criteria\. These findings highlight the potential of SHAP not only as a diagnostic tool for model transparency, but also as a bridge between model predictions and human\-interpretable rubric constructs in educational settings\.

### 2\.2Faithfulness and Reliability of Model Explanations in LLMs

LLMs are increasingly producing free\-form rationales and structured justifications for their predictions\. However, a growing body of work suggests that these explanations may not accurately reflect the underlying computation\. Chain\-of\-thought rationales can be unfaithful even while producing correct answers\(Turpinet al\.,[2023](https://arxiv.org/html/2606.05180#bib.bib25)\), and explanations in few\-shot prompts frequently exhibit inconsistencies or hallucinations\(Ye and Durrett,[2022](https://arxiv.org/html/2606.05180#bib.bib26)\)\. Structured prompting can improve reliability, but challenges remain\(Ayala and Bechard,[2024](https://arxiv.org/html/2606.05180#bib.bib27)\)\.

Recent approaches have proposed adapting Shapley\-based methods to LLMs\(Mohammadi,[2024](https://arxiv.org/html/2606.05180#bib.bib21)\), although computational constraints limit their practical use\. Despite the increased adoption of LLMs in educational settings, the faithfulness of their rationales has not been systematically evaluated relative to established attribution methods such as SHAP, especially on long, naturalistic transcripts and multi\-level rubrics\. Our work addresses this gap by comparing LLM\-generated sentence rankings against PLM\-based SHAP attributions using matched deletion\-based faithfulness tests and cross\-model robustness analysis, thereby providing empirical evidence on when LLM rationales align with model behavior and when they diverge\.

### 2\.3Automated Scoring and Teacher Feedback in Educational Settings

Recent efforts have explored automated approaches to analyze classroom instruction and support teacher learning\. The NCTE dataset\(Demszky and Hill,[2023](https://arxiv.org/html/2606.05180#bib.bib9)\)has facilitated large\-scale research on evaluating teacher uptake of student ideas\(Demszkyet al\.,[2024](https://arxiv.org/html/2606.05180#bib.bib1)\), and on leveraging models such as ChatGPT for instructional scoring and feedback\(Wang and Demszky,[2023](https://arxiv.org/html/2606.05180#bib.bib16)\)\. More broadly, researchers have begun to validate automated assessments of teaching quality dimensions using multimodal approaches, including audio, video, and text features, combining embeddings with LLM\-generated scores\(Houet al\.,[2024](https://arxiv.org/html/2606.05180#bib.bib3),[2025a](https://arxiv.org/html/2606.05180#bib.bib2); Füttereret al\.,[2026](https://arxiv.org/html/2606.05180#bib.bib32)\)\. In particular,Houet al\.\([2025b](https://arxiv.org/html/2606.05180#bib.bib33)\)evaluate LLM\-based multimodal models for classroom assessment by comparing their predictions against human annotations, providing evidence that such models can approximate human judgments of teaching quality\.

In parallel, NLP has long been applied to educational assessment tasks, including essay scoring and discussion analysis\. Recent work shows that LLMs and PLMs can approximate human rubric\-based judgments across multiple writing dimensions\(Seßleret al\.,[2025](https://arxiv.org/html/2606.05180#bib.bib17)\), while neural models have been used to assess the quality of classroom discussions or participation\(Tranet al\.,[2023](https://arxiv.org/html/2606.05180#bib.bib19)\)\. LLMs are increasingly used to provide pedagogically aligned feedback to learners\(Meyeret al\.,[2024](https://arxiv.org/html/2606.05180#bib.bib20)\)\. Our work situates itself within this line of research, but shifts the focus from*predictive performance*to*explanation quality*\. More specifically, we compare fine\-tuned PLMs and instruction\-tuned LLMs for scoring a specific CLASS dimension \(i\.e\., Quality of Feedback\)\. More importantly, we introduce a general framework for evaluating sentence\-level explanations for rubric\-based scores\.

## 3Methods

We cast teaching quality assessment as a general rubric\-based text scoring problem with model\-agnostic sentence\-level explanations and instantiate this framework using both PLMs and instruction\-tuned LLMs\.

![Refer to caption](https://arxiv.org/html/2606.05180v1/x1.png)Figure 1:Overview of the proposed framework\. The top branch \(blue\) shows the PLM pipeline, including fine\-tuning, scoring, SHAP\-based sentence ranking, and sentence removal with re\-scoring\. The bottom branch \(yellow\) depicts the corresponding LLM pipeline with prompted scoring and ranking\. The three experimental settings are indicated bydotted red boxes\.### 3\.1Dataset

To instantiate our interpretability framework in an educational setting, we use the NCTE dataset\(Demszky and Hill,[2023](https://arxiv.org/html/2606.05180#bib.bib9)\), comprising over 1,600 transcripts of 45–60 minute elementary mathematics lessons\. Lessons are segmented into 15\-minute units, producing 6,005 segments annotated using the CLASS framework\(Piantaet al\.,[2008](https://arxiv.org/html/2606.05180#bib.bib8)\), a tool that measures the quality of interactions between teachers and students to assess teaching\. CLASS comprises three domains—Emotional Support,Classroom Management, andInstructional Support—covering thirteen dimensions\. We focus on theQuality of Feedback\(QoF\) dimension within the Instructional Support domain\.

QoF measures the extent to which teachers provide meaningful feedback, scaffold student thinking, prompt metacognition, elaborate on student responses, and clarify misunderstandings\. Each segment receives a QoF rating on a 1–7 scale, where 1–2 indicates low\-quality or absent feedback, 3–5 reflects moderate or inconsistent feedback, and 6–7 represents consistently high\-quality feedback\.

QoF is chosen for three reasons: \(1\) compared to other CLASS dimensions, its label distribution is less skewed \(mean of 4\.21, standard deviation of 1\.13, with 81% of ratings in the 3–5 range\), making it more suitable for supervised learning; \(2\) QoF is a core instructional practice strongly linked to student learning gains\(Hattie and Timperley,[2007](https://arxiv.org/html/2606.05180#bib.bib10)\); and \(3\) many indicators of feedback quality \(e\.g\., probing questions, elaborations, scaffolding\) manifest clearly in text transcripts\(Demszky and Hill,[2023](https://arxiv.org/html/2606.05180#bib.bib9)\), making QoF particularly appropriate for sentence\-level interpretability analysis\.

We adopt an 80/20 data split, resulting in 4,775 segments for training and 1,230 for testing\. Transcripts belonging to the same class were kept in the same split, and we stratified the data based on label distribution\. The training split is used exclusively for fine\-tuning PLMs, whereas the test split is used to evaluate both PLMs and LLMs, as well as to conduct all interpretability experiments\. Within the test split, 29\.3% of the sentences are student utterances, 69\.9% are teacher utterances, and 0\.8% were utterances that could not be assigned to a speaker\. The dataset provides one expert annotation per teaching quality dimension, precluding the assessment of interrater agreement, which we treat as ground truth for both scoring and evaluation\. The dataset does not include human\-annotated sentence\-level rationales or evidence\.

### 3\.2Models

Within our framework, we instantiate the scoring modelffusing two classes of architectures: PLMs and instruction\-tuned LLMs\. We compare these two families because PLMs represent the standard supervised approach for rubric\-based scoring, requiring task\-specific fine\-tuning, whereas LLMs offer strong zero\- or few\-shot capabilities without additional training\. This contrast allows us to examine differences in scoring performance but also how explanation methods behave across models with fundamentally different training paradigms, capacities, and levels of transparency\.

#### Pretrained Language Models\.

We fine\-tunedBERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.05180#bib.bib4)\),ALBERT\(Lanet al\.,[2020](https://arxiv.org/html/2606.05180#bib.bib13)\),RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2606.05180#bib.bib14)\), andDeBERTa V3\(Heet al\.,[2023](https://arxiv.org/html/2606.05180#bib.bib15)\), using both base and large variants\. PLMs operate on transcript segments at the sentence level: each sentence was encoded using the model’s\[CLS\]representation, and a trainable attention layer computed an attention\-weighted document embedding\. A linear regression head predicted a single scalar QoF score\. We used a maximum sentence length of 128 tokens and a maximum of 263 sentences per document, which encompasses 98% of the sentence lengths and 90% of the document lengths without truncation\. During fine\-tuning, models were optimized using mean squared error; additional hyperparameters are listed in App\.[A](https://arxiv.org/html/2606.05180#A1)\. After fine\-tuning, we applied SHAP to the document\-level regression output, treating each sentence embedding as a separate feature\. SHAP returned one Shapley value per sentence, representing its estimated contribution to the predicted score\.

#### Large Language Models\.

For LLMs, we used instruction\-tuned variants of Llama 3\.1 \(8B, 70B\)\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.05180#bib.bib5)\), Mixtral \(8×\\times7B, 8×\\times22B\)\(Jianget al\.,[2024](https://arxiv.org/html/2606.05180#bib.bib6)\), Qwen 3 \(4B, 30B, 235B\)\(Yanget al\.,[2025](https://arxiv.org/html/2606.05180#bib.bib12)\), and Mistral \(Small, Small 24B\)\(Jianget al\.,[2023](https://arxiv.org/html/2606.05180#bib.bib11)\)\. Only open\-source LLMs were selected, as they can be deployed locally and therefore mitigate the privacy concerns associated with datasets such as classroom transcripts\. LLMs performed two tasks: \(1\) QoF scoring using a few\-shot prompt introducing the task and showing examples, and \(2\) sentence ranking using a zero\-shot prompt\. For ranking, LLMs were used as sentence\-level evidence selectors, rather than to free\-form explanations\. We provided transcripts segmented and numbered by sentence and requested a list of ten sentence indices corresponding to the most influential sentences\. This ensured alignment with the PLM sentence boundaries and prevented models from altering sentence content\. Some outputs contained invalid indices or fewer than ten items; in such cases, the system retried up to ten times\. Prompts are shown in App\.[B](https://arxiv.org/html/2606.05180#A2)\. All models were evaluated with deterministic decoding, and all local inference used 4\-bitnf4quantization via BitsAndBytes\.

### 3\.3Interpretability Methods

We view rubric\-based teaching quality assessment as a complex text scoring problem\. More specifically, given an input transcriptxxand a modelff, the model outputs a scalar scoref​\(x\)f\(x\)on a fixed rubric\. In this work, we use the term ‘explanations’ to refer to sentence\-level evidence, i\.e\., subsets of transcript sentences identified as most influential for a model’s prediction, following common usage in extractive explanation methods such as SHAP\. An explanation methodEEmapsxxandffto a ranked list of textual units \(here, sentences\) that are claimed to be most influential for the score\. To evaluate the faithfulness of such explanations, we adopt a deletion\-based protocol: we progressively remove the top\-kkunits selected byEE\(withk=10k=10in all experiments\), recomputef​\(x\)f\(x\), and measure the change in predictions\. Larger changes indicate that the explanation has successfully identified text that the model relies on\. To study cross\-model consistency, we extend this protocol to pairs of modelsffandgg, apply explanations obtained from one model to perturb the other and compare the resulting prediction shifts\. In this paper, we instantiateffas fine\-tuned PLMs andggas instruction\-tuned LLMs, with explanationsEEprovided either by sentence\-level SHAP attributions or by LLM\-generated sentence rankings\.

### 3\.4Experiments

To evaluate our proposed sentence\-level interpretability evaluation framework, we conduct three sets of experiments\. The framework combines sentence\-level explanation generation \(via SHAP or LLM\-based ranking\) with faithfulness testing through sentence deletion and robustness analysis via cross\-model transfer\. Each experiment targets one aspect of this framework across model families\.\(1\) Model performance for scoring \(RQ1\):We evaluate PLMs and LLMs on the transcript segment scoring task\.\(2\) Faithfulness analysis \(RQ2\):We compute score differences \(Δ\\Delta\) after each single\-sentence deletion to assess whether a model’s own explanations meaningfully affect its predictions, as described by the equation:

Δi=f​\(x−ri−1\)−f​\(x−ri\)\\Delta\_\{i\}=f\(x\_\{\-r\_\{i\-1\}\}\)\-f\(x\_\{\-r\_\{i\}\}\)\(1\)
whererir\_\{i\}denotes the i\-th ranked sentence andx−rix\_\{\-r\_\{i\}\}the input after its removal\. All models employ the same sentence\-splitting procedure, and deletions are made in accordance with the ranking order\. When fewer than ten sentences exist or when deletions result in an empty transcript, models are prompted with an empty input\. This only happened for 17 \(1\.3%\) out of 1,230 transcript segments\. In addition to deletion\-based evaluation, we also quantify the alignment between different explanation methods\. For each transcript, we compute the Jaccard similarity between the sets of top\-kksentence indices \(herek=10k=10\) selected by SHAP and by each LLM, i\.e\., the overlap between the top\-k sentence IDs selected by each method for a given transcript, and the Spearman rank correlation, which is computed only when both methods produce valid rankings over the same set of sentences, excluding cases with missing or invalid indices where a consistent ranking is not defined\. These metrics capture how often explanation methods highlight the same textual evidence, independently of their causal impact on predictions\.\(3\) Cross\-model evaluation \(RQ3\):We examine whether sentence\-level explanations generalize across models by applying sentence deletion based on SHAP and LLM rankings to the opposite model family\. We select six representative models for this analysis: BERT large, DeBERTa V3 large, and ALBERT base \(PLMs\) and Qwen 3 235B, Mistral Small, and Llama 3 8B \(LLMs\)\. We select these representative models based on their sensitivity to sentence removal, measured by the cumulative prediction change \(Δ\\Delta\) over the top\-k deletions\. Specifically, we choose models with the highest, lowest, and intermediateΔ\\Deltavalues to capture a range of faithfulness behaviors\.

We report mean absolute error \(MAE\) and mean squared error \(MSE\)\. For constant baselines, we use the median prediction for MAE and the mean prediction for MSE, corresponding to the respective risk minimizers\. All experiments use identical sentence segmentation, deletion rules, and deterministic LLM decoding\.

## 4Results And Discussion

### 4\.1Model Performance for Scoring

Table[1](https://arxiv.org/html/2606.05180#S4.T1)reports Mean Absolute Error \(MAE\) and Mean Squared Error \(MSE\) for all models\. For PLMs, results are shown before and after fine\-tuning, whereas for LLMs, a single score is reported based on the few\-shot scoring prompt \(see App\.[B](https://arxiv.org/html/2606.05180#A2)\)\.

ModelMAEMSEMAEMSEConstant Baseline0\.961\.35––Non\-Fine\-TunedFine\-TunedALBERT base4\.3420\.120\.981\.34ALBERT large4\.2219\.100\.991\.45BERT base4\.5822\.290\.981\.36BERT large4\.4521\.070\.971\.34DeBERTaV3 base4\.0817\.971\.001\.56DeBERTaV3 large3\.6614\.650\.961\.31RoBERTa base4\.3320\.080\.971\.34RoBERTa large3\.2712\.010\.971\.37Llama 3\.1 8B Instruct1\.633\.98––Llama 3\.1 70B Instruct1\.985\.32––Mistral Small Instruct1\.021\.78––Mistral Small 24B Instruct1\.754\.28––Mixtral 8x7B Instruct1\.392\.85––Mixtral 8x22B Instruct1\.212\.41––Qwen3 4B Instruct1\.182\.29––Qwen3 30B A3B Instruct1\.563\.59––Qwen3 235B A22B Instruct1\.674\.16––

Table 1:Mean Absolute Error \(MAE\) and Mean Squared Error \(MSE\) for PLMs and LLMs\. PLM results are reported separately for non\-fine\-tuned and fine\-tuned models, while LLM results correspond to prompted inference without task\-specific training\.For PLMs, fine\-tuning yields a substantial and expected performance improvement\. While non\-fine\-tuned models show MAEs above 4\.0, all fine\-tuned PLMs achieve uniformly low errors \(MAE 0\.96–1\.00; MSE 1\.31–1\.56\), comparable to the constant baseline\. Performance differences among fine\-tuned PLMs are minimal \(Δ=0\.04\\Delta=0\.04MAE,Δ=0\.25\\Delta=0\.25MSE\), indicating comparable behavior across architectures\. The best\-performing model, DeBERTaV3 large, achieves an MAE of 0\.96 and an MSE of 1\.31\. Given the 1–7 scoring scale, an MAE of ~1 corresponds to an average error of approximately one rubric point, indicating that predictions are typically within one level of the expert annotation\. The corresponding MSE values \(~1\.3–1\.5\) reflect relatively small squared deviations, consistent with the low MAE\.

Despite their strong numerical performance, fine\-tuned PLMs exhibit limited label coverage\. For DeBERTaV3 large, the mean predicted score is 4\.14 \(σ=0\.16\\sigma=0\.16\), closely matching the dataset average of 4\.22, but predictions never fall below 2\.03 or exceed 5\.89, meaning that extreme labels \(1 and 7\) are never predicted \(see App\.[C](https://arxiv.org/html/2606.05180#A3)\)\. This behavior is consistent across all fine\-tuned PLMs, with some models \(e\.g\., ALBERT and RoBERTa variants\) effectively collapsing to a narrow mid\-range of labels \(3–5\)\. This pattern is likely driven by strong label imbalance at the extremes of the QoF scale, indicating high sensitivity to the underlying data distribution\.

LLMs show weaker overall scoring performance than fine\-tuned PLMs\. The best\-performing LLM, Mistral Small Instruct, achieves an MAE of 1\.02 and an MSE of 1\.78, which remains higher than both the best PLM and the constant baseline\. In contrast to PLMs, performance variability across LLMs is substantially larger \(Δ=0\.96\\Delta=0\.96MAE,Δ=3\.54\\Delta=3\.54MSE\), reflecting notable differences across models\. However, most LLMs produce predictions spanning the full 1–7 score range\. For Mistral Small Instruct, the mean predicted score is 4\.37 \(σ=0\.77\\sigma=0\.77\), indicating a much broader dispersion than observed for PLMs\.

Overall, these results highlight a trade\-off between accuracy and flexibility\. Fine\-tuned PLMs achieve substantially higher scoring accuracy but are very sensitive to training data distribution, whereas LLMs provide wider score distributions at the cost of reduced accuracy\. In educational settings, LLMs offer practical advantages due to their out\-of\-the\-box usability and lack of task\-specific training requirements, but incur higher computational costs and lower predictive reliability\. In response to RQ1, fine\-tuned PLMs outperform prompted LLMs in accuracy, while LLMs better preserve score variability\.

### 4\.2Faithfulness Analysis

The second experiment evaluates the faithfulness of SHAP\- and LLM\-based sentence importance rankings\. For SHAP, the number of selected sentences is deterministically fixed as the minimum of ten and the number of sentences in each transcript segment\. In contrast, LLMs show limited controllability: even when prompted to return exactly ten sentences, they produce more or fewer often\. Models such as Mixtral 8×\\times7B and Qwen 3 235B deviate most strongly from the expected average of 9\.931 sentences, typically returning fewer sentences despite up to ten retry attempts for malformed outputs\. Llama 3\.1 8B is the only model that, on average, returns more sentences than requested and comes closest to the target value \(see App\.[D](https://arxiv.org/html/2606.05180#A4)\)\.

GroupModel𝚫¯\\mathbf\{\\overline\{\\Delta\}\}PLMsALBERT base0\.0219ALBERT large0\.0172BERT base0\.0256BERT large0\.0329DeBERTaV3 base0\.0242DeBERTaV3 large0\.0049RoBERTa base0\.0053RoBERTa large0\.0119LLMsLlama 3\.1 8B Instruct0\.0174Llama 3\.1 70B Instruct0\.0090Mistral Small Instruct0\.0033Mistral Small 24B Instruct0\.0123Mixtral 8x7B Instruct0\.0121Mixtral 8x22B Instruct0\.0199Qwen3 4B Instruct0\.0211Qwen3 30B A3B Instruct0\.0174Qwen3 235B A22B Instruct0\.0388

Table 2:Average consecutive performance changeΔ¯\\overline\{\\Delta\}across sentence removals for PLMs and LLMs\.These results highlight the inherent variability and limited controllability of LLM outputs, even under structured prompting and multiple retries\. This unpredictability represents a practical limitation when deploying LLMs in educational systems, where reliability and strict adherence to output formats are critical\. Systems that incorporate LLMs must therefore be explicitly designed to handle such inconsistencies if these models are to be used effectively in real\-world educational settings\.

We then remove one sentence at a time from the top ten ranked sentences and re\-score each transcript segment using the same model that produced the ranking\. Table[2](https://arxiv.org/html/2606.05180#S4.T2)reports the average prediction change after each consecutive removal, denoted asΔ¯\\overline\{\\Delta\}, and calculated as follows:

Δ¯=1k​∑i=1kΔi\\overline\{\\Delta\}=\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}\\Delta\_\{i\}\(2\)wherek=10k=10is the number of sentences deleted, andΔi\\Delta\_\{i\}is calculated as described in Eq\.[1](https://arxiv.org/html/2606.05180#S3.E1)\. Among PLMs, BERT models are most sensitive to sentence removal, with BERT large exhibiting the highest average change \(Δ¯=0\.0329\\overline\{\\Delta\}=0\.0329\)\. For LLMs, the largest change is observed for Qwen 3 235B \(Δ¯=0\.0388\\overline\{\\Delta\}=0\.0388\), while all other LLMs showΔ¯\\overline\{\\Delta\}values below 0\.02\. Overall, this indicates that both model families can identify influential sentences, though PLMs do so more consistently, while LLM behavior is more variable \(see Section[4\.3](https://arxiv.org/html/2606.05180#S4.SS3)\)\.

Beyond faithfulness, we observe largely similar patterns in the types of sentences selected by PLMs and LLMs\. Both predominantly prioritize teacher\-authored utterances, consistent with the QoF rating dimension: LLMs select teacher utterances in 79\.5% of cases and student utterances in 19\.5%, while PLMs show a comparable distribution \(74\.0% teacher, 24\.7% student\)\. Despite this similarity, alignment between SHAP\- and LLM\-based explanations remains consistently low\. Across nine LLMs, the mean Jaccard similarity between the indices of the top\-10 sentence sets is 0\.085, corresponding to an overlap of roughly one to two sentences per transcript, while the mean Spearman rank correlation is 0\.062, indicating weak agreement in sentence importance ordering\. These trends are consistent across model families and sizes, suggesting that explanation alignment is driven primarily by the explanation method rather than model scale\.

![Refer to caption](https://arxiv.org/html/2606.05180v1/x2.png)Figure 2:Prediction changeΔ\\Deltaunder progressive sentence removal for selected PLMs and LLMs\. Sentences are chosen using SHAP \(solid lines\) or LLM\-based rankings \(dashed lines\), and inputs are re\-scored by the same model; final changes are shown on the right\.Fig\.[2](https://arxiv.org/html/2606.05180#S4.F2)illustrates prediction changes under progressive sentence removal for selected PLMs and LLMs\. For completeness, App\.[F](https://arxiv.org/html/2606.05180#A6)reports results for all models, App\.[G](https://arxiv.org/html/2606.05180#A7)provides representative sentence examples, and App\.[E](https://arxiv.org/html/2606.05180#A5)contains the full alignment statistics\. NegativeΔ\\Deltavalues indicate that removing a sentence increases the predicted score, implying that the sentence contributed negatively to the model’s assessment\. This behavior is consistent with the rubric, where low\-quality feedback instances are expected to reduce the overall score\.

Together, these results answer RQ2: SHAP\-based sentence rankings are more faithful for PLM scorers, whereas LLM\-generated rationales induce smaller and often unstable prediction changes, even for the models that produce them\.

### 4\.3Cross\-Model Evaluation

For cross\-model evaluation, we select three PLMs and three LLMs representing high, medium, and low faithfulness in the single\-model deletion analysis\. Specifically, we useBERT large\(most affected\),ALBERT base\(moderately affected\), andDeBERTaV3 large\(least affected\) among PLMs, andQwen3 235B,Llama 3\.1 8B, andMistral Smallas corresponding LLMs\. Their sentence\-removal trajectories under self\-generated explanations are shown in Fig\.[2](https://arxiv.org/html/2606.05180#S4.F2), with full results in App\.[F](https://arxiv.org/html/2606.05180#A6)\.

To assess transfer from LLMs to PLMs, we apply LLM\-generated sentence rankings to PLMs and re\-score the perturbed inputs\. Fig\.[3](https://arxiv.org/html/2606.05180#S4.F3)\(i\) compares these results with the baseline condition where PLMs are perturbed using SHAP\-selected sentences\. Across all models, removing LLM\-ranked sentences produces substantially smaller prediction shifts than removing SHAP\-ranked sentences, even for PLMs that are weakly sensitive to SHAP\. Moreover, deletion trajectories induced by LLM rationales are often non\-monotonic, with predictions fluctuating as additional sentences are removed, indicating limited alignment with the features PLMs rely on\. Among LLMs,Qwen3 235Bshows the strongest cross\-model transfer, yielding the largest and most stable perturbations, though still markedly weaker than those induced by SHAP\.

![Refer to caption](https://arxiv.org/html/2606.05180v1/x3.png)Figure 3:Prediction changeΔ\\Deltaunder progressive sentence removal for selected PLMs and LLMs\. Panel \(i\) shows PLM re\-scoring after removing sentences selected by SHAP \(solid\) or LLM\-based rankings \(dashed/dotted\); panel \(ii\) shows LLM re\-scoring after removing sentences selected by the LLM itself \(solid\) or by SHAP from fine\-tuned PLMs \(dashed/dotted\)\.We then consider the reverse direction, applying PLM\-derived SHAP explanations to LLM scoring\. As shown in Fig\.[3](https://arxiv.org/html/2606.05180#S4.F3)\(ii\), removing the single most influential SHAP\-ranked sentence frequently causes a large immediate shift in LLM predictions, followed by stabilization in subsequent deletions\. This effect is most pronounced for PLM–LLM pairs that are highly sensitive to sentence removal\. The consistent “first\-step jump” suggests that SHAP identifies sentence\-level features that are relevant not only to PLMs but also to LLMs, despite architectural and training differences\.

Overall, these results indicate that PLMs and LLMs rely on different sentence\-level evidence\. While LLM rationales often surface intuitively relevant content, they do not reliably capture the features driving PLM predictions, regardless of model sensitivity\. In contrast, SHAP explanations consistently induce larger, more stable, and more coherent prediction shifts both within PLMs and when transferred to LLMs, demonstrating superior faithfulness for sentence\-level interpretability\.

This difference is likely influenced by architectural factors: PLMs are trained with sentence\-level representations, whereas LLMs operate primarily at the token level\. Nevertheless, sentence\-ranking attributions remain a useful tool for improving the interpretability of rubric\-based scoring, particularly when LLMs are used as classifiers\. Addressing RQ3, the cross\-model evaluation shows that SHAP explanations generalize across architectures, while LLM rationales transfer poorly and appear unreliable as general\-purpose explanations\.

### 4\.4Ablation Study

To control for potential structural artifacts, we compare against a random sentence\-deletion baseline matched for sentence length\. Table[3](https://arxiv.org/html/2606.05180#S4.T3)shows a comparison of the average prediction change between removing ranked sentences and removing random sentences\. This baseline yields near\-zero prediction changes between consecutive removals, indicating that the larger effects observed for SHAP\- and LLM\-based rankings are not explained by sentence length or generic perturbation, but by the identification of influential content\.

GroupModelRankedRandom𝚫¯\\mathbf\{\\overline\{\\Delta\}\}𝚫¯\\mathbf\{\\overline\{\\Delta\}\}PLMsALBERT base0\.02190\.0075BERT large0\.03290\.0082DeBERTaV3 large0\.0049\-0\.0064LLMsLlama 3\.1 8B Instruct0\.0174\-0\.0076Mistral Small Instruct0\.00330\.0016Qwen3 235B A22B Instruct0\.0388\-0\.0036

Table 3:Average prediction change \(Δ¯\\overline\{\\Delta\}\) comparing ranked and random sentence removal for PLMs and LLMs, serving as a baseline for explanation faithfulness\.

## 5Conclusion

We propose a general framework for evaluating the faithfulness of sentence\-level explanations in rubric\-based scoring, systematically contrasting Shapley value attributions with LLM\-generated rationales based on their causal impact on model predictions\. Applied to the QoF dimension, the framework shows that fine\-tuned PLMs outperform prompted LLMs in scoring accuracy \(RQ1\), although PLMs exhibit label compression, whereas LLMs provide broader but less precise predictions\. Despite the relatively low error values, there remains room for improvement in prediction accuracy, particularly for extreme score ranges that are underrepresented in the data\. Deletion\-based tests demonstrate that SHAP explanations are substantially more faithful than LLM rationales \(RQ2\), and cross\-model evaluations reveal that SHAP\-selected sentences transfer more robustly across architectures, whereas LLM rationales exert limited influence on PLM predictions \(RQ3\)\. Overall, the results suggest that current LLM rationales are unreliable as faithful justifications for rubric\-based scores, whereas Shapley\-based attributions provide a more stable foundation for transparent and actionable automated assessment\. More broadly, our findings suggest that the proposed framework offers a robust and extensible way to evaluate scoring models and their explanations in high\-stakes settings such as education and can be readily applied to other rubric\-based language assessment tasks \(e\.g\., essay scoring, peer feedback quality, or clinical note evaluation\)\.

## Limitations

A primary limitation of this work is the dataset’s size and label distribution\. Although we selected the least skewed CLASS dimension, only 19% of the labels fall outside the 3–5 range across 6k transcript segments\. This imbalance likely contributes to the observed label compression in fine\-tuned PLMs, limiting model performance at the extremes of the scale\. Future work could explore data augmentation or synthetic data generation to improve coverage of underrepresented classes\. In addition, our experiments focus solely on the Quality of Feedback dimension, and it remains unclear how well the findings generalize to other CLASS dimensions, particularly those that are less discourse\-driven, such as Productivity within the Classroom Management domain\.

Our analysis is further restricted to text\-only transcripts\. CLASS scoring in practice relies on rich, multimodal cues, including prosody, timing, and visual interactional signals, which are currently ignored\. Moreover, each transcript segment is annotated by a single expert, preventing assessment of inter\-rater reliability and leaving open the possibility of annotation noise and subjective bias\. Whereas Quality of Feedback is primarily teacher\-centered, a substantial portion of the transcripts consists of student utterances, which were also frequently selected by the sentence\-ranking methods\. Future work should investigate how explicitly filtering or modeling student contributions affects both scoring and the interpretability of results\.

Finally, the deletion\-based faithfulness protocol perturbs the natural discourse structure of classroom interaction and may alter pragmatic meaning and speaker intent, limiting its ability to reflect true causal influence\. This disruption may disproportionately affect LLM\-based scoring and explanations, as LLMs are trained on coherent sentence generation and strongly rely on intact discourse structure and pragmatic flow\. Additionally, we do not directly evaluate whether the sentences identified by the ranking methods align with the underlying rubric constructs \(i\.e\., whether they are construct\-relevant\)\. Incorporating human judgments on a subset of transcripts and sentence rankings would provide a more direct assessment of this alignment and is an important direction for future work\. Finally, we also observed notable reliability issues in LLM sentence\-ranking outputs, which required strict prompt engineering and external sentence segmentation to enforce structured predictions, thereby constraining the natural expressive capacity of LLM\-based explanations\. Nevertheless, we observe a clear asymmetry: removing SHAP\-selected sentences leads to substantially larger changes in predicted scores than removing LLM\-selected sentences, or randomly selected sentences\. This suggests that the observed differences in faithfulness cannot be attributed solely to discourse disruption, but rather reflect differences in how well each method identifies sentences that are truly influential for the model’s predictions\.

## Ethical considerations

A central ethical concern in automated rubric\-based scoring is the risk of bias amplification from training labels\. Models trained on human annotations may inherit subjective judgments or structural biases and propagate them at scale\(Barocas and Selbst,[2016](https://arxiv.org/html/2606.05180#bib.bib35); Mehrabiet al\.,[2021](https://arxiv.org/html/2606.05180#bib.bib36)\)\. In educational contexts, such effects are particularly concerning, as algorithmic assessment systems can reproduce or exacerbate existing inequalities\(Nguyenet al\.,[2023](https://arxiv.org/html/2606.05180#bib.bib38)\)\.

Accordingly, automated scoring systems should not replace human raters in high\-stakes settings, where unchecked deployment may create self\-reinforcing feedback loops of biased predictions\. Instead, these systems should be positioned as self\-assessment or decision\-support tools that augment, rather than replace, professional judgment, consistent with established principles for ethical and human\-centered AI in education\(Holmeset al\.,[2022](https://arxiv.org/html/2606.05180#bib.bib37); Nguyenet al\.,[2023](https://arxiv.org/html/2606.05180#bib.bib38)\)\.

Relatedly, there is a risk that model predictions and explanations may be interpreted as objective ground truth\. In educational settings, this is particularly problematic, as model\-generated feedback can influence teaching practices, evaluations, and institutional decision\-making\. This underscores the critical need for transparency: educators and stakeholders must be able to understand how and why a model arrives at a given score to appropriately contextualize and challenge its outputs\. To mitigate misuse, it is essential to communicate that both scores and explanations are probabilistic and imperfect, and that interpretability methods are intended to support teacher reflection and informed decision\-making rather than to serve as authoritative or definitive assessments\.

Privacy and data protection are also key ethical considerations\. Although the NCTE transcripts used in this work are anonymized, classroom interactions inherently involve sensitive information about teachers and minors\. Any real\-world deployment of similar systems must ensure strict safeguards for data security, informed consent, and compliance with regulations governing the processing of educational data\.

Furthermore, the use of LLM\-generated rationales introduces additional risks\. LLMs are known to hallucinate, produce inconsistent outputs, and lack transparent decision processes, which makes their explanations particularly problematic in high\-stakes settings such as education\. Limited reproducibility further complicates auditing and accountability\. These factors reinforce the need for caution when using LLM rationales as justifications for automated judgments, and motivate the emphasis of this work on faithfulness\-based evaluation\. In particular, explanation methods should be evaluated not only for plausibility but also for their causal impact on model predictions, and systems should avoid presenting unvalidated LLM rationales as authoritative interpretations of teaching practice\.

#### Use of AI Assistance\.

We used AI assistance tools \(ChatGPT, and GitHub Copilot\) to aid in rewriting code, and paraphrasing\. All AI\-generated content was thoroughly reviewed and verified by the authors\. AI was not used to generate new research ideas or original findings; rather, it served as a support tool to improve clarity, efficiency, and organization\. In accordance with ACL guidelines, our use of AI aligns with permitted assistance categories, and we have transparently reported all relevant usage in this paper\. While AI contributed to enhancing the quality of the work, no direct research outputs are the result of AI assistance\.

## References

- Reducing hallucination in structured outputs via retrieval\-augmented generation\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 6: Industry Track\),Y\. Yang, A\. Davani, A\. Sil, and A\. Kumar \(Eds\.\),Mexico City, Mexico,pp\. 228–238\.External Links:[Link](https://aclanthology.org/2024.naacl-industry.19/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-industry.19)Cited by:[§2\.2](https://arxiv.org/html/2606.05180#S2.SS2.p1.1)\.
- S\. Barocas and A\. D\. Selbst \(2016\)Big data’s disparate impact\.California Law Review\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.15779/Z38BG31)Cited by:[Ethical considerations](https://arxiv.org/html/2606.05180#Sx2.p1.1)\.
- D\. Boulanger and V\. Kumar \(2020\)SHAPed automated essay scoring: explaining writing features’ contributions to english writing organization\.InIntelligent Tutoring Systems,V\. Kumar and C\. Troussas \(Eds\.\),Cham,pp\. 68–78\.External Links:ISBN 978\-3\-030\-49663\-0Cited by:[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p3.1)\.
- D\. Demszky and H\. Hill \(2023\)The NCTE transcripts: a dataset of elementary math classroom transcripts\.InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\),E\. Kochmar, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, N\. Madnani, A\. Tack, V\. Yaneva, Z\. Yuan, and T\. Zesch \(Eds\.\),Toronto, Canada,pp\. 528–538\.External Links:[Link](https://aclanthology.org/2023.bea-1.44/),[Document](https://dx.doi.org/10.18653/v1/2023.bea-1.44)Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p5.1),[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2606.05180#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.05180#S3.SS1.p3.1)\.
- D\. Demszky, J\. Liu, H\. C\. Hill, D\. Jurafsky, and C\. Piech \(2024\)Can automated feedback improve teachers’ uptake of student ideas? evidence from a randomized controlled trial in a large\-scale online course\.Educational Evaluation and Policy Analysis46\(3\),pp\. 483–505\.External Links:[Document](https://dx.doi.org/10.3102/01623737231169270),[Link](https://doi.org/10.3102/01623737231169270),https://doi\.org/10\.3102/01623737231169270Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§3\.2](https://arxiv.org/html/2606.05180#S3.SS2.SSS0.Px1.p1.1)\.
- J\. DeYoung, S\. Jain, N\. F\. Rajani, E\. Lehman, C\. Xiong, R\. Socher, and B\. C\. Wallace \(2020\)ERASER: A benchmark to evaluate rationalized NLP models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4443–4458\.External Links:[Link](https://aclanthology.org/2020.acl-main.408/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.408)Cited by:[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p2.1)\.
- European Parliament and Council of the European Union \(2024\)Regulation \(eu\) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence \(artificial intelligence act\)\.Note:Official Journal of the European Union, OJ L, 2024/1689, 12\.7\.2024Entered into force on 1 August 2024Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p1.1)\.
- T\. Fütterer, R\. Hou, B\. Bühler, E\. Bozkir, C\. Bell, E\. Kasneci, P\. Gerjets, and U\. Trautwein \(2026\)Validating automated assessments of teaching effectiveness using multimodal data\.Learning and Instruction101,pp\. 102264\.External Links:ISSN 0959\-4752,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.learninstruc.2025.102264),[Link](https://www.sciencedirect.com/science/article/pii/S0959475225001884)Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.2](https://arxiv.org/html/2606.05180#S3.SS2.SSS0.Px2.p1.2)\.
- J\. Hattie and H\. Timperley \(2007\)The power of feedback\.Review of Educational Research77\(1\),pp\. 81–112\.External Links:ISSN 00346543, 19351046,[Link](http://www.jstor.org/stable/4624888)Cited by:[§3\.1](https://arxiv.org/html/2606.05180#S3.SS1.p3.1)\.
- P\. He, J\. Gao, and W\. Chen \(2023\)DeBERTaV3: improving deberta using electra\-style pre\-training with gradient\-disentangled embedding sharing\.External Links:2111\.09543,[Link](https://arxiv.org/abs/2111.09543)Cited by:[§3\.2](https://arxiv.org/html/2606.05180#S3.SS2.SSS0.Px1.p1.1)\.
- W\. Holmes, K\. Porayska\-Pomsta, K\. Holstein, E\. Sutherland, T\. Baker, S\. B\. Shum, O\. C\. Santos, M\. T\. Rodrigo, M\. Cukurova, I\. I\. Bittencourt, and K\. R\. Koedinger \(2022\)Ethics of AI in education: towards a community\-wide framework\.International Journal of Artificial Intelligence in Education32\(3\),pp\. 504–526\.External Links:[Document](https://dx.doi.org/10.1007/s40593-021-00239-1)Cited by:[Ethical considerations](https://arxiv.org/html/2606.05180#Sx2.p2.1)\.
- R\. Hou, B\. Bühler, T\. Fütterer, E\. Bozkir, P\. Gerjets, U\. Trautwein, and E\. Kasneci \(2025a\)Multimodal assessment of classroom discourse quality: a text\-centered attention\-based multi\-task learning approach\.arXiv preprint arXiv:2505\.07902\.Cited by:[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p1.1)\.
- R\. Hou, T\. Fütterer, B\. Bühler, E\. Bozkir, P\. Gerjets, U\. Trautwein, and E\. Kasneci \(2024\)Automated assessment of encouragement and warmth in classrooms leveraging multimodal emotional features and chatgpt\.InArtificial Intelligence in Education,A\. M\. Olney, I\. Chounta, Z\. Liu, O\. C\. Santos, and I\. I\. Bittencourt \(Eds\.\),Cham,pp\. 60–74\.External Links:ISBN 978\-3\-031\-64302\-6Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p1.1)\.
- R\. Hou, T\. Fütterer, B\. Bühler, P\. Schreyer, P\. Gerjets, U\. Trautwein, and E\. Kasneci \(2025b\)LLM\-human alignment in evaluating teacher questioning practices: beyond ratings to explanation\.InProceedings of the Artificial Intelligence in Measurement and Education Conference \(AIME\-Con\): Full Papers,J\. Wilson, C\. Ormerod, and M\. Beiting Parrish \(Eds\.\),Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States,pp\. 239–249\.External Links:[Link](https://aclanthology.org/2025.aimecon-main.26/),ISBN 979\-8\-218\-84228\-4Cited by:[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p1.1)\.
- A\. Jacovi and Y\. Goldberg \(2020\)Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness?\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4198–4205\.External Links:[Link](https://aclanthology.org/2020.acl-main.386/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.386)Cited by:[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p2.1)\.
- S\. Jain and B\. C\. Wallace \(2019\)Attention is not Explanation\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 3543–3556\.External Links:[Link](https://aclanthology.org/N19-1357/),[Document](https://dx.doi.org/10.18653/v1/N19-1357)Cited by:[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p2.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§3\.2](https://arxiv.org/html/2606.05180#S3.SS2.SSS0.Px2.p1.2)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, E\. B\. Hanna, F\. Bressand, G\. Lengyel, G\. Bour, G\. Lample, L\. R\. Lavaud, L\. Saulnier, M\. Lachaux, P\. Stock, S\. Subramanian, S\. Yang, S\. Antoniak, T\. L\. Scao, T\. Gervet, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2024\)Mixtral of experts\.External Links:2401\.04088,[Link](https://arxiv.org/abs/2401.04088)Cited by:[§3\.2](https://arxiv.org/html/2606.05180#S3.SS2.SSS0.Px2.p1.2)\.
- A\. Khediri, H\. Slimi, A\. Yahiaoui, M\. Derdour, H\. Bendjenna, and C\. E\. Ghenai \(2024\)Enhancing machine learning model interpretability in intrusion detection systems through shap explanations and llm\-generated descriptions\.In2024 6th International Conference on Pattern Analysis and Intelligent Systems \(PAIS\),Vol\.,pp\. 1–6\.External Links:[Document](https://dx.doi.org/10.1109/PAIS62114.2024.10541168)Cited by:[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p1.1)\.
- V\. Kumar and D\. Boulanger \(2020\)Explainable automated essay scoring: deep learning really has pedagogical value\.Frontiers in EducationVolume 5 \- 2020\.External Links:[Link](https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2020.572367),[Document](https://dx.doi.org/10.3389/feduc.2020.572367),ISSN 2504\-284XCited by:[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p3.1)\.
- Z\. Lan, M\. Chen, S\. Goodman, K\. Gimpel, P\. Sharma, and R\. Soricut \(2020\)ALBERT: a lite bert for self\-supervised learning of language representations\.External Links:1909\.11942,[Link](https://arxiv.org/abs/1909.11942)Cited by:[§3\.2](https://arxiv.org/html/2606.05180#S3.SS2.SSS0.Px1.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized bert pretraining approach\.External Links:1907\.11692,[Link](https://arxiv.org/abs/1907.11692)Cited by:[§3\.2](https://arxiv.org/html/2606.05180#S3.SS2.SSS0.Px1.p1.1)\.
- S\. M\. Lundberg and S\. Lee \(2017\)A unified approach to interpreting model predictions\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 4768–4777\.External Links:ISBN 9781510860964Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p1.1)\.
- N\. Mehrabi, F\. Morstatter, N\. Saxena, K\. Lerman, and A\. Galstyan \(2021\)A survey on bias and fairness in machine learning\.ACM Comput\. Surv\.54\(6\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3457607),[Document](https://dx.doi.org/10.1145/3457607)Cited by:[Ethical considerations](https://arxiv.org/html/2606.05180#Sx2.p1.1)\.
- J\. Meyer, T\. Jansen, R\. Schiller, L\. W\. Liebenow, M\. Steinbach, A\. Horbach, and J\. Fleckenstein \(2024\)Using llms to bring evidence\-based feedback into the classroom: ai\-generated feedback increases secondary students’ text revision, motivation, and positive emotions\.Computers and Education: Artificial Intelligence6,pp\. 100199\.External Links:ISSN 2666\-920X,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.caeai.2023.100199),[Link](https://www.sciencedirect.com/science/article/pii/S2666920X23000784)Cited by:[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p2.1)\.
- B\. Mohammadi \(2024\)Explaining Large Language Models Decisions Using Shapley Values\.arXiv\.Note:arXiv:2404\.01332 \[cs\]External Links:[Link](http://arxiv.org/abs/2404.01332),[Document](https://dx.doi.org/10.48550/arXiv.2404.01332)Cited by:[§2\.2](https://arxiv.org/html/2606.05180#S2.SS2.p2.1)\.
- A\. Nguyen, H\. N\. Ngo, Y\. Hong, B\. Dang, and B\. T\. Nguyen \(2023\)Ethical principles for artificial intelligence in education\.Education and Information Technologies28\(4\),pp\. 4221–4241\.External Links:[Document](https://dx.doi.org/10.1007/s10639-022-11316-w)Cited by:[Ethical considerations](https://arxiv.org/html/2606.05180#Sx2.p1.1),[Ethical considerations](https://arxiv.org/html/2606.05180#Sx2.p2.1)\.
- R\. C\. Pianta, K\. M\. L\. Paro, and B\. K\. Hamre \(2008\)Classroom assessment scoring system™: manual k\-3\.\.Brookes Publishing\.External Links:[Link](https://api.semanticscholar.org/CorpusID:69768532)Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.05180#S3.SS1.p1.1)\.
- M\. T\. Ribeiro, S\. Singh, and C\. Guestrin \(2016\)"Why should i trust you?": explaining the predictions of any classifier\.External Links:1602\.04938,[Link](https://arxiv.org/abs/1602.04938)Cited by:[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p1.1)\.
- K\. Seßler, M\. Fürstenberg, B\. Bühler, and E\. Kasneci \(2025\)Can ai grade your essays? a comparative analysis of large language models and teacher ratings in multidimensional essay scoring\.InProceedings of the 15th International Learning Analytics and Knowledge Conference,pp\. 462–472\.Cited by:[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p2.1)\.
- N\. Tran, B\. Pierce, D\. Litman, R\. Correnti, and L\. C\. Matsumura \(2023\)Utilizing natural language processing for automated assessment of classroom discussion\.InArtificial Intelligence in Education\. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky,N\. Wang, G\. Rebolledo\-Mendez, V\. Dimitrova, N\. Matsuda, and O\. C\. Santos \(Eds\.\),Cham,pp\. 490–496\.External Links:ISBN 978\-3\-031\-36336\-8Cited by:[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p2.1)\.
- M\. Turpin, J\. Michael, E\. Perez, and S\. R\. Bowman \(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p1.1),[§1](https://arxiv.org/html/2606.05180#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.05180#S2.SS2.p1.1)\.
- R\. Wang and D\. Demszky \(2023\)Is ChatGPT a good teacher coach? measuring zero\-shot performance for scoring and providing actionable insights on classroom instruction\.InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\),E\. Kochmar, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, N\. Madnani, A\. Tack, V\. Yaneva, Z\. Yuan, and T\. Zesch \(Eds\.\),Toronto, Canada,pp\. 626–667\.External Links:[Link](https://aclanthology.org/2023.bea-1.53/),[Document](https://dx.doi.org/10.18653/v1/2023.bea-1.53)Cited by:[§2\.3](https://arxiv.org/html/2606.05180#S2.SS3.p1.1)\.
- S\. Wiegreffe and Y\. Pinter \(2019\)Attention is not not explanation\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 11–20\.External Links:[Link](https://aclanthology.org/D19-1002/),[Document](https://dx.doi.org/10.18653/v1/D19-1002)Cited by:[§2\.1](https://arxiv.org/html/2606.05180#S2.SS1.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§3\.2](https://arxiv.org/html/2606.05180#S3.SS2.SSS0.Px2.p1.2)\.
- X\. Ye and G\. Durrett \(2022\)The unreliability of explanations in few\-shot prompting for textual reasoning\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§1](https://arxiv.org/html/2606.05180#S1.p1.1),[§1](https://arxiv.org/html/2606.05180#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.05180#S2.SS2.p1.1)\.

## Appendix ATraining Hyperparameters

The model was fine\-tuned using the Hugging FaceTrainerAPI with the following key hyperparameters:

- •Learning rate:1×10−51\\times 10^\{\-5\}
- •Batch size:1 per device \(with gradient accumulation of 8 steps\)
- •Number of epochs:10
- •Weight decay:0\.01
- •Maximum gradient norm:1\.0
- •Evaluation strategy:per epoch
- •Saving strategy:per epoch \(keeping best model based on validation MAE\)
- •Learning rate scheduler:cosine schedule
- •Warmup ratio:0\.1
- •Mixed precision:FP16 when available
- •Early stopping:patience of 3 validation checks

All other hyperparameters were kept at their default values\. One single H200 GPU was used and computations took a total of∼\\sim322 hours \(∼\\sim20 hours for PLMs and∼\\sim302 hours for LLMs\)\. All experiments could have been done in a single A100 GPU, since we used the quantized versions of the models, except for the LLMs Qwen3 235B A22B Instruct, and Mixtral 8×\\times22B Instruct, which require more than 80GB VRAM, even in their 4\-bitnf4quantized versions\.

## Appendix BPrompts

Fig\.[4](https://arxiv.org/html/2606.05180#A2.F4)shows an example of the few\-shot prompt transcript Quality of Feedback scoring, and Fig\.[5](https://arxiv.org/html/2606.05180#A2.F5)show the zero\-shot prompt for sentence ranking\.

You are a rater for classroom transcripts\. You are tasked with evaluating the transcripts to rate the dimension ’Quality of Feedback’ as one of the dimensions of the domain ’Instructional Support’ according to the Classroom Assessment Scoring System \(CLASS\)\.Your goal is to rate the provided transcript by focusing on the teacher’s interactions labeled as "Teacher" and the students’ responses labeled as "Student"\.\*\*Your Task:\*\*Please read the following classroom transcript and assign a rating based on the Quality of Feedback dimension\.Quality of Feedback assesses the degree to which feedback expands and extends learning and understanding and encourages student participation\. In upper elementary classrooms, significant feedback may also be provided by peers\. Regardless of the source, the focus here should be on the nature of the feedback provided and the extent to which it “pushes” learning\.Here are examples for the different score values in the Quality of Feedback dimension:An Example for score value 1 is \{Transcript segment from training set with score 1\}An Example for score value 2 is \{Transcript segment from training set with score 2\}An Example for score value 3 is \{Transcript segment from training set with score 3\}An Example for score value 4 is \{Transcript segment from training set with score 4\}An Example for score value 5 is \{Transcript segment from training set with score 5\}An Example for score value 6 is \{Transcript segment from training set with score 6\}An Example for score value 7 is \{Transcript segment from training set with score 7\}You must begin your answer with‘\#\#\# Rating: <the score on a 1\-7 integer scale here\>’\#\#\# Transcript:\{Transcript segment to score\}\#\#\# Rating:Figure 4:Prompts used for few\-shot prompting of LLMs for the task of Quality of Feedback scoring of a classroom transcript segment\. \{Transcript segment from training set with score 1\-7\} represents an actual example from the training set, where the specific score was given to the transcript by an expert annotator, and \{Transcript segment to score\} represents the transcript segment to be scored\.You are an expert in classroom discourse analysis and the CLASS \(Classroom Assessment Scoring System\) framework\. You will receive:1\. A transcript of a classroom interaction\.2\. The same transcript divided by numbered sentences, formatted as:\(1\) \- <SENTENCE \#1\>\(2\) \- <SENTENCE \#2\>…3\. A score \(1\-7\) for the "Quality of Feedback" domain according to the CLASS framework, annotated by an expert\. 1 indicates very low quality feedback, while 7 indicates very high quality feedback\.Your task:\* Identify the sentences \(by their numbers\) that most strongly influenced this score\.\* Focus specifically on aspects relevant to the "Quality of Feedback" domain \(e\.g\., scaffolding, prompting, encouragement of thought processes, questioning strategies, or absence of these features\)\.\* Return exactly 10 sentence numbers, ordered by estimated importance \(most influential first\)\.\* If fewer than 10 sentences exist, return only those available\.Format your response \*\*strictly\*\* as a single line of comma\-separated numbers \(no spaces, no brackets, no explanations\)\. For example: 9,15,6,37,2,7,8,1,64,66Do not include any commentary, reasoning, or additional text\.You must begin your answer with‘\#\#\# Output: <your comma\-separated list\>’1\. Transcript:\{Full transcript segment\}2\. Numbered Sentences:\(1\) \- \{First sentence\}\(2\) \- \{Second sentence\}\(3\) \- \{Third sentence\}…3\. Score:\{Previously predicted score\}\#\#\# Output:Figure 5:Zero\-shot prompt for LLM sentence ranking\. \{Full transcript segment\} represents the full transcript segment, with the same formatting as it appears in the dataset, \{First/Second/Third/… sentence\} represents the transcript split into sentences and numbered by order of appearance, and \{Previously predicted score\} represents the Quality of Feedback score predicted for the transcript segment either by the LLMs or by the PLMs\.
## Appendix CPrediction Statistics

Table[4](https://arxiv.org/html/2606.05180#A3.T4)presents the full prediction statistics for PLMs and LLMs\. It shows the mean values for prediction, standard deviation, and if any label was not present in the predictions of a specific model\.

ModelMeanStd\. Dev\.Labels MissingData \(Training and Test\)4\.221\.14–ALBERT base4\.100\.281, 2, 6, 7ALBERT large4\.330\.411, 2, 7RoBERTa base4\.270\.221, 2, 6, 7RoBERTa large4\.010\.111, 2, 6, 7BERT base4\.150\.381, 6, 7BERT large4\.140\.411, 2, 7DeBERTa V3 base4\.490\.471, 7DeBERTa V3 large4\.140\.161, 2, 6, 7Llama 3\.1 8B Instruct4\.911\.60–Llama 3\.1 70B Instruct2\.951\.65–Mistral Small Instruct4\.370\.771, 7Mistral Small 24B Instruct2\.550\.696, 7Mixtral 8x7B Instruct3\.020\.526, 7Mixtral 8x22B Instruct4\.301\.19–Qwen3 4B Instruct3\.961\.12–Qwen3 30B A3B Instruct2\.921\.006, 7Qwen3 235B A22B Instruct2\.801\.137

Table 4:Prediction statistics for fine\-tuned PLMs and LLMs\. We report the mean and standard deviation of predicted QoF scores on the test set, as well as the CLASS labels \(1–7\) that were never predicted by each model\.
## Appendix DAverage Number of Sentences Predicted

Table[5](https://arxiv.org/html/2606.05180#A4.T5)reports the average number of sentences output by each LLM, together with the absolute difference \(Δ\\Delta\) from the expected average of 9\.931 sentences\.

Model\# of sentencesΔ\\DeltaLlama 3\.1 8B Instruct9\.939\-0\.008Llama 3\.1 70B Instruct9\.8860\.045Mistral Small Instruct9\.6540\.277Mistral Small 24B Instruct9\.8710\.060Mixtral 8x7B Instruct9\.2910\.640Mixtral 8x22B Instruct9\.8360\.095Qwen3 4B Instruct9\.7110\.220Qwen3 30B A3B Instruct9\.8280\.102Qwen3 235B A22B Instruct9\.5160\.415

Table 5:Average number of sentences each model outputs for sentence ranking, and the difference between the expected and observed number of sentences\.
## Appendix EAdditional Details on Explanation Alignment

Table[6](https://arxiv.org/html/2606.05180#A5.T6)reports the alignment between SHAP\-based and LLM\-based explanations across 1,230 transcripts\. Jaccard similarity is computed between the sets of the top\-10 sentences identified by each method for every transcript\. Spearman rank correlation is computed between sentence importance rankings derived from absolute SHAP values and LLM deletion order\. Spearman correlations are reported only for transcripts where the correlation is well\-defined, resulting in between 1,226 and 1,228 transcripts per model\. Multiple SHAP runs are aggregated by averaging absolute SHAP values per sentence prior to ranking\.

ModelJaccard@10Spearmanρ\\rhoLlama 3\.1 8B Instruct0\.081±0\.1350\.081\\pm 0\.1350\.048±0\.1340\.048\\pm 0\.134Llama 3\.1 70B Instruct0\.089±0\.1360\.089\\pm 0\.1360\.067±0\.1290\.067\\pm 0\.129Mistral Small Instruct0\.081±0\.1280\.081\\pm 0\.1280\.054±0\.1290\.054\\pm 0\.129Mistral Small 24B Instruct0\.090±0\.1410\.090\\pm 0\.1410\.082±0\.138\\pm 0\.138Mixtral 8x7B Instruct0\.072±0\.1310\.072\\pm 0\.1310\.029±0\.1250\.029\\pm 0\.125Mixtral 8x22B Instruct0\.088±0\.1420\.088\\pm 0\.1420\.070±0\.1340\.070\\pm 0\.134Qwen3 4B Instruct0\.084±0\.1390\.084\\pm 0\.1390\.057±0\.1380\.057\\pm 0\.138Qwen3 30B A3B Instruct0\.089±0\.1540\.089\\pm 0\.1540\.071±0\.1340\.071\\pm 0\.134Qwen3 235B A22B Instruct0\.095±0\.159\\pm 0\.1590\.081±0\.1420\.081\\pm 0\.142Average0\.085\\mathbf\{0\.085\}0\.062\\mathbf\{0\.062\}

Table 6:Alignment between SHAP and LLM explanations measured by Jaccard similarity over top\-10 sentences and Spearman rank correlation\. Values are reported as mean±\\pmstandard deviation\.
## Appendix FResults of Sentence Removal

Table[7](https://arxiv.org/html/2606.05180#A6.T7)shows the cumulative delta values after each step of the top 10 sentences removal\. Fig\.[6](https://arxiv.org/html/2606.05180#A6.F6), represents it in a graphical form\.

ModelNumber of Sentences Removed12345678910PLMsALBERT base0\.0510\.0850\.1150\.1400\.1570\.1730\.1850\.1980\.2100\.219ALBERT large0\.0240\.0440\.0620\.0800\.0970\.1140\.1290\.1440\.1600\.172RoBERTa base0\.0070\.0140\.0190\.0220\.0280\.0330\.0400\.0440\.0480\.053BERT base0\.0680\.1030\.1340\.1580\.1800\.1960\.2130\.2290\.2420\.256BERT large0\.0800\.1350\.1780\.2100\.2370\.2610\.2820\.3000\.3150\.329DeBERTaV3 base0\.0480\.0790\.1040\.1350\.1560\.1760\.1920\.2140\.2290\.242DeBERTaV3 large0\.0090\.0150\.0210\.0260\.0310\.0350\.0390\.0430\.0460\.049LLMsLlama 3\.1 8B Instruct\-0\.0170\.0070\.0390\.0790\.0650\.0980\.1290\.1060\.1310\.174Llama 3\.1 70B Instruct\-0\.041\-0\.0060\.0160\.0350\.0650\.0720\.0540\.0900\.0940\.090Mistral Small Instruct0\.0070\.0210\.0170\.0130\.0120\.0110\.0220\.0070\.0120\.033Mistral Small 24B Instruct\-0\.0010\.0220\.0350\.0410\.0470\.0640\.0770\.1000\.1010\.123Mixtral 8x7B Instruct\-0\.016\-0\.017\-0\.0070\.0040\.0110\.0150\.0260\.0460\.0540\.121Mixtral 8x22B Instruct0\.001\-0\.0290\.0140\.0410\.0490\.0600\.0720\.1150\.1530\.199Qwen3 4B Instruct\-0\.0160\.0040\.0290\.0440\.0790\.0980\.1320\.1440\.1440\.211Qwen3 30B A3B Instruct0\.0160\.0320\.0460\.0630\.0810\.0830\.1190\.1270\.1390\.174Qwen3 235B A22B Instruct0\.0310\.0700\.0870\.1010\.1270\.1550\.1560\.1940\.2470\.388Table 7:Performance deltas after removing consecutive numbers of sentences\. Top: PLMs\. Bottom: LLMs\.![Refer to caption](https://arxiv.org/html/2606.05180v1/x4.png)Figure 6:Average prediction changes after removing most important sentences and re\-scoring within model family\.
## Appendix GRemoved Sentence Examples

Table[8](https://arxiv.org/html/2606.05180#A7.T8)lists randomly selected sentences removed during the sentence\-deletion experiments\. Sentences are grouped by the explanation method\(s\) that selected them\.

ModelSentenceLLM\- You think it’s not a polygon because it has a curved side\.\- Teacher: Okay, what you do is, yeah, you cut it in half again so you have their equivalents\.\- Remember in math I spent a lot of time going over with you every time you answer a question you have to a?\- Is this area right here inside the O?\- What’s ten times 32?\- What does she need to put right next to the inches?\- Student: Um \- Teacher: Try it \- try a number for X and see how well it works for you\.\- Teacher: Add my two sums of my two grids\.\- Technically if you’re finding area, you’re finding the number of squares inside the shape, correct?\- So are they different?PLM\- 1, 2, 3\.\- Teacher: I could divide it by 2, yeah\.\- Why don’t we count on the 9th?\- Tell somebody in your group what you learned in math today\.\- What happens to 6?\- 635\.\- Student: Can we write an answer sentence?\- Multiple Students: Label\.\- Student: …\- Teacher: Plus 19 \- Student D, what’s 2 plus 9?Both\- And now I can find my total which will be what, Student H?\- Student: It’s going to get bigger\.\- Teacher: So what is the denominator there?\- As you’re doing this and you’re answering the questions I want to label it\.\- So what is 12 divided by 2?\- Okay so 85 plus 15 would bring me to my next whole\.\- What happens to 6?\- Teacher: You have to go ahead …\.\- Multiple Students: Less\.\- Student: It’s three times bigger\.Table 8:Example sentences removed during sentence\-deletion experiments, grouped by whether they were selected by LLM\-based explanations, PLM\-based explanations, or by both\.

Similar Articles

Applied Explainability for Large Language Models: A Comparative Study

arXiv cs.CL

A comparative study evaluating three explainability techniques (Integrated Gradients, Attention Rollout, SHAP) on fine-tuned DistilBERT for sentiment classification, highlighting trade-offs between gradient-based, attention-based, and model-agnostic approaches for LLM interpretability.

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

arXiv cs.CL

This paper proposes a training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-judge without human annotation, and further introduces an iterative fine-tuning strategy for a rubric generator that outperforms larger proprietary models.