Multimodal Speaker Identification in Classroom Environments
Summary
This paper evaluates a multimodal framework for speaker identification in K-12 classrooms by combining acoustic embeddings (ECAPA-TDNN) with LLM-derived semantic context from transcripts, improving accuracy from 39% to 50.3% overall and from 64.9% to 76.9% for longer utterances.
View Cached Full Text
Cached at: 06/15/26, 09:00 AM
# Multimodal Speaker Identification in Classroom Environments
Source: [https://arxiv.org/html/2606.13712](https://arxiv.org/html/2606.13712)
Michael Leon Chrzan1, Meghavarshini Krishnaswamy1, Robert Gibboni2, Katie Wetstone2, Wei Ai3, and Jing Liu1
###### Abstract
Automated analysis of K\-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic\-only models\. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM\-derived semantic context\. Using a subset of the EDSI dataset \(8 math classrooms,N=2,801utterancesN=2,801\\text\{ utterances\}\), we found an acoustic baseline \(ECAPA\-TDNN\) achieved only 39\.0% accuracy\. By integrating transcript\-based ”contextual anchoring” into a gradient boosting classifier, our multimodal approach raised student identification to 50\.3%\. Performance also improved for utterances over 5 seconds, reaching 76\.9% accuracy \(vs\. 64\.9% baseline\) with a 90\.9% Top\-3 accuracy\. Additionally, the model distinguished teacher vs\. student roles with 99\.3% accuracy\. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale\.
## IIntroduction
The automated analysis of classroom discourse has emerged as a pivotal frontier in educational technology, promising to provide scalable, objective feedback to teachers regarding instructional quality, student engagement, and equity of participation\[[28](https://arxiv.org/html/2606.13712#bib.bib1),[6](https://arxiv.org/html/2606.13712#bib.bib2),[7](https://arxiv.org/html/2606.13712#bib.bib3),[9](https://arxiv.org/html/2606.13712#bib.bib4),[5](https://arxiv.org/html/2606.13712#bib.bib5)\]\. Central to this analysis is the task of speaker identification \(SID\) which seeks to answer the fundamental questions of “who spoke when” and “who said what”\[[12](https://arxiv.org/html/2606.13712#bib.bib18)\]\. While traditional systems have relied heavily on acoustic biometrics, recent advancements in deep learning and Large Language Models \(LLMs\) suggest that a paradigm shift towards multimodal frameworks is useful for addressing the unique and adversarial acoustic conditions of the K\-12 classroom\.
The classroom environment presents a “perfect storm” of challenges for conventional audio processing: high levels of non\-stationary background noise, significant reverberation, and the distinctive, highly variable spectral characteristics of children’s speech\. Purely acoustic approaches frequently falter in these settings, struggling to distinguish between target speakers and the pervasive “babble noise” of peer discussions\[[23](https://arxiv.org/html/2606.13712#bib.bib19),[28](https://arxiv.org/html/2606.13712#bib.bib1),[25](https://arxiv.org/html/2606.13712#bib.bib17),[15](https://arxiv.org/html/2606.13712#bib.bib20)\]\.
While acoustic models provide physical evidence of identity, we propose an additional LLM Speaker Inference component, which introduces evidence derived from the semantic content of the speech\. The literature identifies this as a burgeoning field of “LLM\-adaptive diarization”, where language models are used to correct, refine, and attribute speaker turns based on context\[[26](https://arxiv.org/html/2606.13712#bib.bib7),[18](https://arxiv.org/html/2606.13712#bib.bib11),[10](https://arxiv.org/html/2606.13712#bib.bib6)\]\. This research focuses precisely on how pre\-existing transcripts can serve as input for LLMs to derive advanced stylistic features, which are then integrated with high\-fidelity acoustic embeddings\.
We work to answer the following question: what is the comparative performance gain achieved by supplementing audio embeddings with context\-rich textual features \(teacher cues and diarization labels\) versus a purely acoustic feature set, particularly concerning the identification of non\-teacher speakers? To that end, our study demonstrates a process for incorporating information from transcripts of classroom speech with acoustic models to improve speaker ID performance\.
Figure 1:Speaker Identification Uses
Speaker identification in this study served two main purposes, as displayed in this figure: \(1\) To adhere to legal and ethical responsibilities towards students who did not consent to be part of the benchmark dataset we needed a way to identify and anonymize their speech data and \(2\) Track individual students throughout a session to enable detailed research about classroom speaking patterns and interactions\.
## IIRelated Work
Speaker identity is an inherent multimodal construct defined by two complementary biometric elements: the acoustic signature and the linguistic signature\[[14](https://arxiv.org/html/2606.13712#bib.bib8)\]\. The acoustic signature relates to the physical and physiological characteristics of the vocal tract, determining pitch, timbre, and prosody\. The linguistic signature, conversely, encapsulates the speaker’s acquired, high\-level behavioral habits, including syntactic complexity, preferred vocabulary, discourse markers, and pragmatic strategies\. These linguistic choices, often studied under the umbrella of computational sociolinguistics, are stable markers unique to an individual’s education, dialect, and social context\.
Speaker Identification is a foundational task in speech processing, essential for biometric authentication, forensic analysis, and personalized conversational AI systems\[[10](https://arxiv.org/html/2606.13712#bib.bib6),[26](https://arxiv.org/html/2606.13712#bib.bib7)\]\. Historically dominated by acoustic signal processing, the field is undergoing a paradigm shift driven by the integration of Large Language Models \(LLMs\), enabling the system to leverage linguistic context alongside vocal characteristics\.
A speaker’s identity is encoded not only inwhatthey say \(lexical content\) but also inhowthey structure the conversation and manage social interaction\. LLMs are uniquely equipped to process these complex, context\-sensitive linguistic phenomena\. LLMs can infer identity through “contextual anchoring”\. If a teacher nominates a student by name in a given utterance \(“What do you think, Jason?”\), the LLM can identify the addressee of the utterance, and predict that the subsequent speaker is highly likely to be “Jason”\. Similarly, if the transcript contains “Mr\. Smith, can you help me?”, the addressee is Mr\. Smith \(Teacher\), the speaker is highly likely to be a student and the next speaker is likely to be Mr\. Smith\. In both these examples, the named addressee can also be eliminated from the list of potential speakers\. This transforms the problem from unsupervised clustering to semi\-supervised tracking, using Named Entity Recognition \(NER\) to create “anchors” that constrain the acoustic clustering algorithm\[[19](https://arxiv.org/html/2606.13712#bib.bib9)\]\. Thus, LLMs can move beyond simple semantic tagging to identify the functional components of speech acts\.
Recent work has begun to explore multimodal approaches to address the acoustic challenges of the classroom\. For example, Perez et al\. \(TeachFX\) demonstrated that integrating LLM\-based re\-scoring with acoustic clustering significantly improved the distinction between teacher and student speech\. However, while their approach focuses on binary role classification \(Teacher vs\. Student\), our work extends this multimodal framework to the task of granular speaker identification\. We leverage ’contextual anchoring’ within the transcript not just to separate roles, but to distinguish individual students from one another—a significantly more complex task given the acoustic similarity among peer speakers\[[20](https://arxiv.org/html/2606.13712#bib.bib27)\]\.
In addition, previous work has shown the potential of automated feedback to be transformative for classroom practice\. That potential lies in its ability to offer a scalable, efficient, and objective alternative to traditional, resource\-intensive instructional measurement\. Historically, identifying complex instructional features or discourse strategies required time\-consuming and sometimes inconsistent manual analysis by expert raters\[[6](https://arxiv.org/html/2606.13712#bib.bib2),[18](https://arxiv.org/html/2606.13712#bib.bib11)\]\. However, the growing literature suggests that Natural Language Processing \(NLP\) techniques provide a transformative approach for instructional measurement\. These tools can automate the analysis of classroom transcripts, providing private, on\-demand feedback to teachers, which has been shown in some cases to lead to positive impacts on educators’ instruction quality and student outcomes\[[6](https://arxiv.org/html/2606.13712#bib.bib2),[5](https://arxiv.org/html/2606.13712#bib.bib5),[8](https://arxiv.org/html/2606.13712#bib.bib13),[7](https://arxiv.org/html/2606.13712#bib.bib3),[17](https://arxiv.org/html/2606.13712#bib.bib12)\]\. This efficiency allows for the development of feedback systems that are cost\-efficient and can be delivered quickly and frequently\. Crucial to the current and future effectiveness and expansiveness of that feedback is the ability to both accurately identify the teacher’s speech from the students’ and also individual students from one another\.
TABLE I:Summary of Teacher\-Student Talk Time and Classroom MetricsNote\.s = seconds; Average columns contain Standard Deviations in parenthesis\.
## IIIData
This study utilizes data from the Center for Educational Data Science and Innovation \(EDSI\) Dataset, a comprehensive educational repository developed at the University of Maryland–College Park\[[2](https://arxiv.org/html/2606.13712#bib.bib14)\]\. The EDSI dataset was constructed to advance the application of artificial intelligence \(AI\) in educational research and practice, specifically within the context of mathematics classrooms\. It integrates multimodal data sources to provide a holistic view of classroom dynamics, distinguishing itself through a rich combination of raw observational data and structured metadata\.
The core components of the dataset include high\-fidelity audio and video recordings of classroom interactions, accompanied by high\-quality transcriptions\. These observational data are linked with extensive metadata, including:
- •Demographics: detailed profiles of student and teacher characteristics\.
- •Classroom Artifacts: contextual materials such as seating charts\.
- •Metrics: student achievement data and psychometric survey responses from both students and teachers\.
Data collection for the EDSI dataset employs a multi\-microphone setup to capture the complex auditory environment of active classrooms\. The audio is recorded using Swivl robotic platforms equipped with five labeled microphones to track and record audio from different sources simultaneously\[[24](https://arxiv.org/html/2606.13712#bib.bib15)\]\. For transcription purposes, these individual audio streams are combined into a single mixed\-audio file\. This composite file is then processed by TranscribeMe to generate highly accurate professional\-grade transcripts of classroom discourse which included the transcribed speech, start and end times of utterances, and diarization labels\[[27](https://arxiv.org/html/2606.13712#bib.bib16)\]\.
For the present study, the analytical sample consists of recordings from eight different mathematics classrooms \(88\) selected from the broader EDSI corpus, as detailed in Table[I](https://arxiv.org/html/2606.13712#S2.T1)\. These classes were selected for their variation in student participation as well as the different mixture of background noise based on instructional choices \(e\.g\. some teachers had a higher propensity for group work than others\)\. For each of these eight classes, we utilized the mixed\-audio file generated during the collection phase along with its corresponding transcript for model building\. For establishing ground truth speaker identification, we use these files along with their corresponding video files and administrative data \(such as seating charts, class rosters, etc\)\. While the raw dataset contained3,9993,999utterances, the final labeled dataset for training was filtered to2,8012,801utterances following the adjudication process described in our section[IV\-A](https://arxiv.org/html/2606.13712#S4.SS1)\.
## IVMethods
We developed an automated speaker identification system to assign individual speaker identities to each utterance in classroom mathematics instruction videos\. The system combines speaker embedding models with gradient boosting classifiers to distinguish between teachers and students, and among individual students within a classroom session\.
### IV\-ASpeaker Annotation
Two trained annotators \(including the first author\) conducted speaker identification for each utterance in the classroom transcripts\. Annotators worked with multiple sources of contextual information: the mixed\-audio track, the classroom video recording, the seating chart, and the class roster\. These materials were used in combination to maximize the accuracy of speaker attribution\.
For each utterance in the transcript, annotators entered responses into an amended transcript file containing additional columns designed for this task\. The required codes were as follows:
- •transcript\_file\_name: The name of the transcript file being annotated\.
- •turn\_idx: The zero\-indexed row number of the utterance in the raw transcript file, ensuring alignment with the transcript when read programmatically \(e\.g\., in pandas\)\.
- •identified \(0/1\): A binary indicator of whether the speaker could be matched to the teacher or a consenting student in the class\.
- •first\_name / last\_name: If identified, the speaker’s first and last name, exactly matching the entries in the administrative data \(including capitalization and special characters\)\. If the speaker was not identified, these fields were left blank\.
- •unidentified\_type: If the speaker could not be matched, annotators classified the voice into one of the following categories: - –child: a distinctly child voice not matched to a known student\. - –adult: a distinctly adult voice not matched to the teacher\. - –multiple\_students: overlapping speech from 2–3 students collapsed into a single utterance\. - –whole\_class: more than 2–3 students speaking simultaneously, collapsed into one utterance\. - –other: voices not representing classroom participants \(e\.g\., loudspeaker announcements, videos\)\. - –impossible: turns that could not be attributed to a single speaker \(e\.g\., \[crosstalk\], \[inaudible\], or extremely short utterances\)\.
- •notes: A free\-text field for annotators to record idiosyncrasies, uncertainties, or contextual observations during annotation\.
This structured coding scheme ensured consistency across annotators and facilitated downstream quantitative analysis\.
After completing the first round of annotation, the two annotators met to review discrepancies, particularly in classrooms with low inter\-rater reliability \(IRR\) as measured by Cohen’s kappa\[[4](https://arxiv.org/html/2606.13712#bib.bib21)\]\. These meetings focused on clarifying ambiguous codes—notably the distinction betweenmultiple\_studentsandwhole\_class—and resolving systematic sources of disagreement\. Following consensus\-building discussions, annotators re\-annotated the affected lessons to improve reliability and ensure methodological rigor\.
Challenges in the speaker annotation process also arose from incomplete video consent in some classes, which meant that only a subset of students were visible in the recordings \(up to over half of the students in some of the courses, see Table[I](https://arxiv.org/html/2606.13712#S2.T1)\)\. As a result, annotators had to rely on audio cues, seating chart positions, and contextual inference rather than direct visual confirmation for some utterances in those courses, particularly when non\-consenting students spoke more often in those sessions\.
Altogether, the annotators agreed on90\.92%of their annotations for the 3999 utterances and \- despite these challenges \- were able to confidently identify66\.7%of all utterances with a specific speaker name \(26682668utterances\)\. Overall and individual session agreement levels can be found in Table[II](https://arxiv.org/html/2606.13712#S4.T2), where we reached almost perfect agreement on 6 of the sessions and substantial agreement on 2 sessions, leading to almost perfect agreement overall\[[13](https://arxiv.org/html/2606.13712#bib.bib22)\]\.
To finalize the dataset, we conducted a manual review of the remaining disagreements\. This review revealed that the discrepancies were primarily between the generic ’child’ label and a specific student name\. We found that the primary annotator provided more specific identifications in these instances compared to the second annotator\. To reduce label noise and maximize the quantity of labeled data available for training, we decided to adopt the primary annotator’s specific labels in these conflicting cases\. Consequently, after resolving these disagreements and filtering out unidentifiable turns \(such aswhole\_classorimpossible\) the final ground truth dataset consisted ofN=2,801N=2,801utterances\.
TABLE II:Cohen’s Kappa by TeacherNote\.95% Confidence Interval in parens\.
### IV\-BFeature Engineering
Given the high inter\-rater reliability, we determined that the data was suitable ground truth labels for model training and evaluation and moved forward with model building using features generated from both the text transcripts and audio data\.
We first focused on generating speaker embeddings for identifying speakers\. Each classroom session included voice enrollment recordings for all consented participants \(teacher and students\)\. These recordings, collected at the beginning of the school year, consisted of 30\-60 seconds of speech per speaker reading standardized text\. Voice enrollments served as reference exemplars for computing speaker embeddings\. Using this audio, we segmented each speaker’s recording into 3\-second overlapping windows \(with 0\-second overlap\) and computed embeddings for each window, creating multiple reference embeddings per enrolled speaker\.
We extracted speaker embeddings using the SpeechBrain ECAPA\-TDNN model \(spkrec\-ecapa\-voxceleb\), a state\-of\-the\-art speaker verification system pre\-trained on VoxCeleb\[[22](https://arxiv.org/html/2606.13712#bib.bib26)\]\. This model produces 192\-dimensional embeddings that capture speaker\-specific acoustic characteristics while being robust to recording conditions and speech content\. All audio was resampled to 16 kHz prior to embedding extraction\.
We formulated our speaker identification task as a binary classification problem: for each utterance and each enrolled speaker \(candidate\), predict whether the candidate is the true speaker\. This formulation generates multiple candidate\-utterance pairs per utterance, with exactly one positive \(correct\) match per utterance\.
Figure 2:Structured classroom speaker\-identification prompt\.system\_prompt:\|
Identifyspeakersfromaclassroomtranscript\.
prompt:\|
IhaveadiarizedtranscriptandIwantyoutoidentifywhoisspeaking\.
Theconversationisfroma4thgrademathclassroom\.
Providealistofdictionariesmappingutteranceindextospeakerindex
foranyturnsyoucanconfidentlyidentifyandanexplanationofyour
thoughtprocess\.
OnlyincludeturnsthatyoucanCONFIDENTLYidentifyaseithertheteacher
oraspecificstudent\.Eachnameshouldexactlymatchoneofthespeaker
indexesprovidedbelow\.
Teacher:
$\{teacher\_name\}
Students:
$\{student\_names\}
Transcript:
$\{transcript\}
parameters:
\-name:teacher\_name
default:\|
\-0:AnneFinch
\-name:student\_names
default:\|
\-1:AnneSparrow
\-2:RobHummingbird
\-name:transcript
default:\|
HiBob,doyouknow3\+1?
YesAlice,Idoit’s4
Isthatright,Mr\.Eagle?
Further, we engineered a comprehensive feature set capturing multiple aspects of speaker similarity:
- •Utterance\-to\-Enrollment Similarity: For each candidate speaker, we computed cosine distances between the utterance embedding and all voice enrollment embeddings for that candidate\. We extracted summary statistics including mean, median, 10th percentile, 90th percentile, minimum, and maximum distances\. Lower cosine distances indicate higher similarity\.
- •Speaker Group Features: Utterances were initially grouped by their transcript\-based TranscribeMe speaker labels \(e\.g\., “S1”, “S2” for students\)\. For each speaker group, we computed: - –Number and proportion of utterances in the group - –Inter\-utterance similarity within the group \(mean and variance of pairwise cosine similarities\)\. - –Similarity between each candidate enrollment and the speaker group’s mean embedding\. - –Similarity between each candidate enrollment and the longest utterance in the group\.
- •Intra\-Group Utterance Distances: For each utterance, we computed distances to other utterances assigned to the same speaker group, capturing consistency of speaker identity within transcript\-labeled clusters\.
- •LLM\-Inferred Speaker: We incorporated a binary feature indicating whether GPT\-5\-mini \(using transcript context\) inferred the current candidate as the utterance’s speaker using the prompt in Figure[2](https://arxiv.org/html/2606.13712#S4.F2)\. For candidates identified by the LLM, we also computed distance statistics to other utterances the LLM assigned to that same speaker across the course\.
- •Utterance Duration: Duration \(in seconds\) of each utterance, as speaker identification accuracy tends to increase with longer utterances\.
All features with missing values \(e\.g\., when LLM inference was unavailable\) were filled with a sentinel value of \-999 to allow the gradient boosting model to learn separate decision rules for missing data\.
### IV\-CModel Architecture and Training
We selected XGBoost, a gradient boosting decision tree algorithm, as our primary classification model due to its strong performance on tabular data with heterogeneous features, natural handling of missing values, and computational efficiency\[[3](https://arxiv.org/html/2606.13712#bib.bib23)\]\. Given our task formulation, the model predicts a probability that each candidate speaker is the true speaker for a given utterance\.
TABLE III:Hyperparameter Search SpaceWe performed Bayesian hyperparameter optimization using Optuna with 50 trials\[[1](https://arxiv.org/html/2606.13712#bib.bib24)\]\. The optimization process used a nested cross\-validation strategy:
1. 1\.Outer Loop \(Leave\-One\-Session\-Out\): For each of the eight annotated sessions, we held out that session as a test set and used the remaining seven sessions for training and validation\.
2. 2\.Inner Loop \(Validation Split\): Within the training set, we further separated one additional session as a validation set for early stopping, ensuring the test session remained completely unseen during hyperparameter selection\.
For our training objective, we maximized Top\-3 student accuracy \(the proportion of student utterances for which the true speaker appeared in the Top\-3 predicted candidates\) averaged across all folds\. The hyperparameter search space is detailed in Table[III](https://arxiv.org/html/2606.13712#S4.T3)\. The best hyperparameters identified were:
- •n\_estimators = 800
- •learning\_rate = 0\.043
- •max\_depth = 3
- •min\_child\_weight = 8\.13
- •subsample = 0\.87
- •colsample\_bytree = 0\.72
- •gamma = 0\.14
- •reg\_alpha = 0\.086
- •reg\_lambda = 3\.22
- •scale\_pos\_weight = 41\.21
After hyperparameter selection, we trained a final production model on all eight annotated sessions using the optimal hyperparameters\. We reserved 20% of utterances \(stratified by session\) as an evaluation set for early stopping during this final training phase\. Further, XGBoost outputs were calibrated using sigmoid calibration \(Platt scaling\) fitted on leave\-one\-session\-out predictions\[[21](https://arxiv.org/html/2606.13712#bib.bib25)\]\. This ensured predicted probabilities reflected true likelihood of correct speaker assignment, facilitating principled decision\-making for speaker assignment thresholds\.
Figure 3:Pipeline for Classroom Session Analysis
System architecture for classroom speaker identification\. The figure depicts the pipeline, which integrates transcript inference from LLMs, diarization labels, utterance audio embeddings, and voice classification using similarities to voice enrollments to produce the final speaker probabilities\.
## Results
### IV\-AModel Evaluation
We evaluated model performance using leave\-one\-session\-out cross\-validation across all eight annotated sessions\. For each session, we trained on the remaining seven sessions and predicted on the held\-out session, ensuring completely independent evaluation\.
In evaluating model performance, we employed a set of primary metrics designed to capture both overall and role\-specific classification accuracy\.Overall Accuracywas defined as the proportion of utterances correctly attributed to the true speaker, based on the candidate with the highest predicted probability\. To assess role differentiation,Student vs\. Teacher Accuracymeasured the proportion of utterances correctly classified according to speaker role\. Within the subset of student utterances,Student Identification Accuracyquantified the proportion correctly assigned to the specific student, thereby reflecting the model’s ability to distinguish among multiple student speakers\. Finally,Top\-k Accuracyprovided a more flexible measure of predictive reliability by calculating the proportion of utterances in which the true speaker appeared within the topkkpredicted candidates, with values reported fork=1,3,5k=1,3,5and1010\. This suite of metrics offers a comprehensive framework for evaluating both fine\-grained speaker identification and broader role\-level classification\.
TABLE IV:Identification AccuracyNote\.Speaker identification accuracy of our XGBoost model at different utterance lengths\. Total number of utterances \(2,801\) reflects a reduction of the total utterances in sample \(3999\) as we only attempt to identify utterances for which a speaker could be identified\.
We also stratified performance by utterance duration, as acoustic speaker identification typically improves with longer audio samples\.
### IV\-BBaseline Comparison
To quantify the performance gain achieved by our multimodal framework, we compared the final XGBoost classifier against a baseline ”Acoustic\-Only” approach\. The baseline method assigned speaker identity solely based on the lowest cosine distance between the utterance embedding and the voice enrollment samples \(ECAPA\-TDNN pre\-trained on VoxCeleb\), without the benefit of transcript\-based features or metadata\-aware re\-ranking\.
As shown in Table[V](https://arxiv.org/html/2606.13712#Sx1.T5), the multimodal XGBoost classifier yielded meaningful improvements\. The most dramatic gains were observed in Student Identification, where the proposed model improved accuracy by 32 percentage points \(from 39% to 71%\) compared to the acoustic baseline\.
TABLE V:Top\-1 Performance Comparison: Acoustic Baseline vs\. Multimodal XGBoost*Note: pp = percentage points\.*
The model showed similar gains in distinguishing between roles, boosting Student vs\. Teacher accuracy from 88\.0% to a near\-perfect 99\.3% \(\+11\.3 percentage points\)\. Notably, this performance advantage persisted even in longer utterances \(\>\>5s\), where the acoustic signal is typically more stable\. In these cases, the multimodal approach improved identification accuracy from 64\.9% to 76\.9%, a gain of 12\.0 percentage points\. These results suggest that even when acoustic conditions are favorable, the addition of semantic context and metadata provides a necessary lift to resolve ambiguities in identifying child speakers\.
### IV\-CFindings
Our analysis of the2,8012,801utterances reveals that multimodal speaker identification in classrooms performance is heavily dependent on utterance duration and speaker role\. As detailed in Table[IV](https://arxiv.org/html/2606.13712#Sx1.T4), the dataset is skewed toward short interactions, with over 58% of utterances lasting 3 seconds or less\. Despite the challenging distribution, the model demonstrates high ability to distinguish distinct speaker roles and to identify speakers in longer turns\.
Teacher vs\. Student DifferentiationThe system achieved near\-perfect accuracy in distinguishing between teachers and students, with an overall accuracy of99\.3%\. Performance in this category was robust even for the shortest utterances \(0\-1s\), achieving98\.0%accuracy and reached100%accuracy for all utterances longer than 5 seconds\.
Student IdentificationIdentifying individual students proved more challenging than role detection, particularly for short utterances\. The overall Top\-1 accuracy for specific student identification was50\.3%\. However, this metric is clearly impacted by the brevity of classroom dialogue\.
- •Impact of Duration: Identification accuracy scales strictly with utterance length\. For substantive turns \(longer than 10 seconds\), the model achieve a79\.2%Top\-1 accuracy and a95\.8%Top\-3 accuracy\. Conversely, short utterances \(0\-1s\) showed a Top\-1 accuracy of only41\.0%\.
- •Top\-k Performance: The model proves highly effective as a candidate filtering system\. While Top\-1 accuracy for utterance under 3 seconds hovered between 41\-55%, the Top\-5 accuracy for these same segments rose to71\.8\-82\.1%\. Across all durations, the correct speaker was found withing the top 10 candidates 87\.0% of the time, and 100% of the time for turns over 10 seconds\.
These results indicate that while short, context\-poor utterances remain an adversarial frontier, the multimodal architecture effectively isolates the correct speaker for the substantive contributions necessary for analyzing student reasoning and discourse\.
## Discussion
Critically, we find that the model performs best where it matters most for educational feedback\. The near\-perfect discrimination between teacher and student roles \(99\.3%\) immediately enables the reliable automation of evaluation of teacher\-only metrics, such as “Uptake”\[[7](https://arxiv.org/html/2606.13712#bib.bib3)\]or “Questioning”\[[6](https://arxiv.org/html/2606.13712#bib.bib2)\]\. Furthermore, the model’s performance scales positively with the semantic richness of the turn\. For utterances exceeding 10 seconds—segments that typically represent students articulating mathematical reasoning or complex ideas—the model achieves a Top\-3 accuracy of 95\.8%\. This indicates that while the system may struggle with fleeting backchannels, it successfully recovers the “high\-value” contributions necessary for tracking individual student mastery and engagement\.
### IV\-AThe Teacher\-Student Performance Gap
Our results indicate a distinct performance gap between teacher and student identification, with teacher identification achieving near\-perfect accuracy even at shorter durations\. This disparity is likely driven by two convergent factors: data abundance and acoustic model bias\.
First, the classroom environment is inherently imbalanced\. As shown in Table[I](https://arxiv.org/html/2606.13712#S2.T1), teachers consistently dominate the acoustic space, with average turn lengths \(4\.88s–18\.8s\) overwhelmingly exceeding those of students \(0\.75s–3\.11s\)\. This provides the model with substantially more acoustic information for teachers, resulting in more robust centroids for the target speaker\.
Second, the acoustic embedding model used \(ECAPA\-TDNN\) was pre\-trained on VoxCeleb, a dataset comprised almost exclusively of adult speech\. As noted in the introduction, children’s speech exhibits higher spectral variability and higher fundamental frequencies than adult speech\. Consequently, the pre\-trained feature extractor is likely operating in\-domain for teachers \(adults\) but out\-of\-domain for students \(children\), leading to the observed degradation in student\-specific accuracy\. Our baseline comparison confirms that while acoustic embeddings alone \(VoxCeleb\) are insufficient for distinguishing between children \(39% accuracy\), they provide a necessary foundation that, when combined with semantic features, allows the classifier to reach 71% accuracy\. We attempted to solve for this issue by using versions of the model fine\-tuned with children’s voice data \- following Tabatabaee et\. al\.\[[25](https://arxiv.org/html/2606.13712#bib.bib17)\]\- but found that the model performed no better \(and in a few instances worse\) for these classrooms\.
### IV\-BThe “Short Utterance” Frontier
The drop in accuracy for utterances under 3 seconds \(Top\-1 41\.0%–55\.0%\) highlights the physical limitations of acoustic biometrics\. Speaker embeddings require sufficient phonetic variation to form a unique signature\. Very short student utterances in K\-12 classrooms often consist of “backchanneling” \(e\.g\., “Yeah”, “Okay”\) or rapid peer\-to\-peer overlapping speech\. Utterance length as well as high environmental noise impact x\-vector based speaker identification models, with duration impacting model training more than testing\[[16](https://arxiv.org/html/2606.13712#bib.bib29),[11](https://arxiv.org/html/2606.13712#bib.bib28)\]\. Magrin\-Chagnolleau et\. al\.\[[16](https://arxiv.org/html/2606.13712#bib.bib29)\]show that the biggest jump in model performance occurs between an utterance length of 1 and 2 seconds\. This is validated in our findings as well, which shows that across all ‘Top\-k’ experiments, the largest jump in performance is from an utterance duration of 0\-1 second to 1\-3 seconds \(see Table\-[IV](https://arxiv.org/html/2606.13712#Sx1.T4)\)\. These segments lack the temporal duration required to extract stable x\-vectors or ECAPA\-TDNN embeddings\.
However, the high Top\-5 accuracy \(71\.8%–82\.1%\) for these short turns suggests that while the model cannot pinpoint the exact student with high confidence, it successfully narrows the search space\. This validates the multimodal approach: the acoustic model filters the candidates, potentially allowing the LLM’s semantic context to make the final disambiguation in instances where transcripts capture specific addressee cues \(e\.g\., “Good job, Michael”\)\. This narrowed search space also vastly reduces the human effort needed to generate high\-quality ground truth labels in future work\.
## Conclusion
This study evaluates a multimodal framework for speaker identification in the “perfect storm” of K\-12 mathematics classrooms—an environment characterized by non\-stationary noise and highly variable child speech\. By integrating ECAPA\-TDNN acoustic embeddings with LLM\-inferred semantic anchors, we demonstrate a robust method for attributing classroom discourse\.
Our findings show that this approach solves the critical task of differentiating teacher from student speech with over 99% accuracy, a foundational requirement for automated instructional feedback systems\. While identifying specific students remains challenging in short, sub\-second utterances, the model achieves high reliability \(95\.8% Top\-3 accuracy\) for longer, substantively rich turns\. This suggests that the “linguistic signature” provided by transcript context significantly complements acoustic biometrics when sufficient semantic information is present\.
These results advance the feasibility of scalable, equitable classroom analytics\. By accurately tracking “who said what”, such systems can move beyond aggregate class metrics to provide granular insights into individual student participation and teacher\-student interaction patterns\. Future work should focus on improving resolution for short utterances to capture the full spectrum of rapid\-fire classroom discourse\.
## Acknowledgment
We thank the Gates Foundation, the Walton Family Foundation, and the Chan Zuckerberg Initiative for their generous financial support\.
We also gratefully acknowledge the contributions of Talya Lebson, who provided careful and consistent annotation of the speaker identification dataset\. Their diligence in applying the coding framework and attention to detail were essential to ensuring the reliability and validity of the data used in this study\.
## References
- \[1\]\(2019\-07\)Optuna: A Next\-generation Hyperparameter Optimization Framework\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,Anchorage AK USA,pp\. 2623–2631\(en\)\.External Links:ISBN 978\-1\-4503\-6201\-6,[Link](https://dl.acm.org/doi/10.1145/3292500.3330701),[Document](https://dx.doi.org/10.1145/3292500.3330701)Cited by:[§IV\-C](https://arxiv.org/html/2606.13712#S4.SS3.p2.1)\.
- \[2\]Benchmark Data Collection \| Center for Educational Data Science and Innovation\.External Links:[Link](https://edsi.umd.edu/projects/benchmark-data-collection)Cited by:[§III](https://arxiv.org/html/2606.13712#S3.p1.1)\.
- \[3\]T\. Chen and C\. Guestrin\(2016\-08\)XGBoost: A Scalable Tree Boosting System\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.Note:arXiv:1603\.02754 \[cs\]External Links:[Link](http://arxiv.org/abs/1603.02754),[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[§IV\-C](https://arxiv.org/html/2606.13712#S4.SS3.p1.1)\.
- \[4\]J\. Cohen\(1960\)A Coefficient of Agreement for Nominal Scales\.Educational and Psychological Measurement20\(1\),pp\. 37–46\.Note:\_eprint: https://doi\.org/10\.1177/001316446002000104External Links:[Link](https://doi.org/10.1177/001316446002000104),[Document](https://dx.doi.org/10.1177/001316446002000104)Cited by:[§IV\-A](https://arxiv.org/html/2606.13712#S4.SS1.p5.1)\.
- \[5\]D\. Demszky, J\. Liu, H\. C\. Hill, D\. Jurafsky, and C\. Piech\(2024\-09\)Can Automated Feedback Improve Teachers’ Uptake of Student Ideas? Evidence From a Randomized Controlled Trial in a Large\-Scale Online Course\.Educational Evaluation and Policy Analysis46\(3\),pp\. 483–505\(EN\)\.Note:Publisher: American Educational Research AssociationExternal Links:ISSN 0162\-3737,[Link](https://doi.org/10.3102/01623737231169270),[Document](https://dx.doi.org/10.3102/01623737231169270)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p1.1),[§II](https://arxiv.org/html/2606.13712#S2.p5.1)\.
- \[6\]D\. Demszky, J\. Liu, H\. C\. Hill, S\. Sanghi, and A\. Chung\(2025\-04\)Automated feedback improves teachers’ questioning quality in brick\-and\-mortar classrooms: Opportunities for further enhancement\.Computers & Education227,pp\. 105183\(en\)\.External Links:ISSN 03601315,[Link](https://linkinghub.elsevier.com/retrieve/pii/S0360131524001970),[Document](https://dx.doi.org/10.1016/j.compedu.2024.105183)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p1.1),[§II](https://arxiv.org/html/2606.13712#S2.p5.1),[Discussion](https://arxiv.org/html/2606.13712#Sx2.p1.1)\.
- \[7\]D\. Demszky, J\. Liu, Z\. Mancenido, J\. Cohen, H\. Hill, D\. Jurafsky, and T\. Hashimoto\(2021\-08\)Measuring Conversational Uptake: A Case Study on Student\-Teacher Interactions\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 1638–1653\.External Links:[Link](https://aclanthology.org/2021.acl-long.130/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.130)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p1.1),[§II](https://arxiv.org/html/2606.13712#S2.p5.1),[Discussion](https://arxiv.org/html/2606.13712#Sx2.p1.1)\.
- \[8\]Dorottya Demszky, D\. Demszky, Liu Jing, J\. Liu, Heather C\. Hill, H\. C\. Hill, Dan Jurafsky, D\. Jurafsky, Chris Piech, and Chris Piech\(2021\-01\)Can Automated Feedback Improve Teachers’ Uptake of Student Ideas? Evidence From a Randomized Controlled Trial In a Large\-Scale Online Course\.Educational Evaluation and Policy Analysis\.Note:MAG ID: 3208686870Cited by:[§II](https://arxiv.org/html/2606.13712#S2.p5.1)\.
- \[9\]Dorottya Demszky and Jing Liu\(2023\-07\)M\-Powering Teachers: Natural Language Processing Powered Feedback Improves 1:1 Instruction and Student Outcomes\.Proceedings of the Tenth ACM Conference on Learning @ Scale,pp\. 59–69\.Note:MAG ID: 4384661557External Links:[Document](https://dx.doi.org/10.1145/3573051.3593379),[Document](https://dx.doi.org/10.1145/3573051.3593379)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p1.1)\.
- \[10\]G\. Efstathiadis, V\. Yadav, and A\. Abbas\(2025\-05\)LLM\-based speaker diarization correction: A generalizable approach\.Speech Communication170,pp\. 103224\.Note:arXiv:2406\.04927 \[eess\]External Links:ISSN 01676393,[Link](http://arxiv.org/abs/2406.04927),[Document](https://dx.doi.org/10.1016/j.specom.2025.103224)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p3.1),[§II](https://arxiv.org/html/2606.13712#S2.p2.1)\.
- \[11\]A\. Gusev, V\. Volokhov, T\. Andzhukaev, S\. Novoselov, G\. Lavrentyeva, M\. Volkova, A\. Gazizullina, A\. Shulipa, A\. Gorlanov, A\. Avdeeva,et al\.\(2020\)Deep speaker embeddings for far\-field speaker recognition on short utterances\.arXiv preprint arXiv:2002\.06033\.Cited by:[§IV\-B](https://arxiv.org/html/2606.13712#Sx2.SS2.p1.1)\.
- \[12\]N\. Kanda, X\. Xiao, Y\. Gaur, X\. Wang, Z\. Meng, Z\. Chen, and T\. Yoshioka\(2022\-01\)Transcribe\-to\-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End\-to\-End Speaker\-Attributed ASR\.arXiv\.Note:arXiv:2110\.03151 \[eess\]External Links:[Link](http://arxiv.org/abs/2110.03151),[Document](https://dx.doi.org/10.48550/arXiv.2110.03151)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p1.1)\.
- \[13\]J\. R\. Landis and G\. G\. Koch\(1977\-03\)The measurement of observer agreement for categorical data\.Biometrics33\(1\),pp\. 159–174\(eng\)\.External Links:ISSN 0006\-341XCited by:[§IV\-A](https://arxiv.org/html/2606.13712#S4.SS1.p7.1)\.
- \[14\]E\. Loweimi, M\. Qian, K\. Knill, and M\. Gales\(2024\-09\)On the Usefulness of Speaker Embeddings for Speaker Retrieval in the Wild: A Comparative Study of x\-vector and ECAPA\-TDNN Models\.InInterspeech 2024,pp\. 3774–3778\(en\)\.External Links:[Link](https://www.isca-archive.org/interspeech_2024/loweimi24_interspeech.html),[Document](https://dx.doi.org/10.21437/Interspeech.2024-161)Cited by:[§II](https://arxiv.org/html/2606.13712#S2.p1.1)\.
- \[15\]Y\. Ma, J\. B\. Wiggins, M\. Celepkolu, K\. E\. Boyer, C\. Lynch, and E\. Wiebe\(2021\)The Challenge of Noisy Classrooms: Speaker Detection During Elementary Students’ Collaborative Dialogue\.InArtificial Intelligence in Education,I\. Roll, D\. McNamara, S\. Sosnovsky, R\. Luckin, and V\. Dimitrova \(Eds\.\),Cham,pp\. 268–281\(en\)\.External Links:ISBN 978\-3\-030\-78292\-4,[Document](https://dx.doi.org/10.1007/978-3-030-78292-4%5F22)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p2.1)\.
- \[16\]I\. Magrin\-Chagnolleau, J\. Bonastre, and F\. Bimbot\(2024\)Effect of utterance duration and phonetic content on speaker identification using second\-order statistical methods\.External Links:2402\.16429,[Link](https://arxiv.org/abs/2402.16429)Cited by:[§IV\-B](https://arxiv.org/html/2606.13712#Sx2.SS2.p1.1)\.
- \[17\]Malamut, James, Demszky, Dorottya, Bywater, Christine, Reinhart, Michele, and Hill, Heather C\.\(2025\-09\)Facilitating Evidence\-Based Instructional Coaching With Automated Feedback on Teacher Discourse\.\(en\)\.Note:EdWorkingPaper: 25\-1349External Links:[Link](https://edworkingpapers.com/ai25-1298),[Document](https://dx.doi.org/10.26300/XX9Z-8F27)Cited by:[§II](https://arxiv.org/html/2606.13712#S2.p5.1)\.
- \[18\]T\. Meizlish and C\. Ziffo\(2025\-09\)Evaluating an LLM’s Performance in Annotating Discourse Strategies\.In Review\(en\)\.External Links:[Link](https://www.researchsquare.com/article/rs-7246876/v1),[Document](https://dx.doi.org/10.21203/rs.3.rs-7246876/v1)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p3.1),[§II](https://arxiv.org/html/2606.13712#S2.p5.1)\.
- \[19\]M\. Nguyen, F\. Dernoncourt, S\. Yoon, H\. Deilamsalehy, H\. Tan, R\. Rossi, Q\. H\. Tran, T\. Bui, and T\. H\. Nguyen\(2024\-07\)Identifying Speakers in Dialogue Transcripts: A Text\-based Approach Using Pretrained Language Models\.arXiv\.Note:arXiv:2407\.12094 \[cs\] version: 1External Links:[Link](http://arxiv.org/abs/2407.12094),[Document](https://dx.doi.org/10.48550/arXiv.2407.12094)Cited by:[§II](https://arxiv.org/html/2606.13712#S2.p3.1)\.
- \[20\]M\. Perez, B\. Coker, K\. B\. Kocabagli, J\. Vitale, and A\. Van Camp\(2025\)Multimodal Classroom Diarization with GPT Re\-scoring: Teacher or Student?\.InArtificial Intelligence in Education,A\. I\. Cristea, E\. Walker, Y\. Lu, O\. C\. Santos, and S\. Isotani \(Eds\.\),Cham,pp\. 228–235\(en\)\.External Links:ISBN 978\-3\-031\-98462\-4,[Document](https://dx.doi.org/10.1007/978-3-031-98462-4%5F29)Cited by:[§II](https://arxiv.org/html/2606.13712#S2.p4.1)\.
- \[21\]J\. Platt\(2000\-06\)Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods\.Adv\. Large Margin Classif\.10\.Cited by:[§IV\-C](https://arxiv.org/html/2606.13712#S4.SS3.p5.1)\.
- \[22\]M\. Ravanelli, T\. Parcollet, P\. Plantinga, A\. Rouhe, S\. Cornell, L\. Lugosch, C\. Subakan, N\. Dawalatabad, A\. Heba, J\. Zhong, J\. Chou, S\. Yeh, S\. Fu, C\. Liao, E\. Rastorgueva, F\. Grondin, W\. Aris, H\. Na, Y\. Gao, R\. D\. Mori, and Y\. Bengio\(2021\-06\)SpeechBrain: A General\-Purpose Speech Toolkit\.arXiv\.Note:arXiv:2106\.04624 \[eess\]External Links:[Link](http://arxiv.org/abs/2106.04624),[Document](https://dx.doi.org/10.48550/arXiv.2106.04624)Cited by:[§IV\-B](https://arxiv.org/html/2606.13712#S4.SS2.p3.1)\.
- \[23\]R\. Southwell, S\. Pugh, E\. Margaret Perkoff, C\. Clevenger, J\. Bush, R\. Lieber, W\. Ward, P\. Foltz, and S\. D’Mello\(2022\-07\)Challenges and Feasibility of Automatic Speech Recognition for Modeling Student Collaborative Discourse in Classrooms\.International Educational Data Mining Society\(en\)\.External Links:[Document](https://dx.doi.org/10.5281/ZENODO.6853109)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p2.1)\.
- \[24\]Swivl\.\(en\-US\)\.External Links:[Link](https://www.swivl.com/about-us/)Cited by:[§III](https://arxiv.org/html/2606.13712#S3.p3.1)\.
- \[25\]S\. Tabatabaee, J\. Liu, and C\. Espy\-Wilson\(2025\-08\)FT\-Boosted SV: Towards Noise Robust Speaker Verification for English Speaking Classroom Environments\.InInterspeech 2025,pp\. 2815–2819\(en\)\.External Links:[Link](https://www.isca-archive.org/interspeech_2025/tabatabaee25_interspeech.html),[Document](https://dx.doi.org/10.21437/Interspeech.2025-1002)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p2.1),[§IV\-A](https://arxiv.org/html/2606.13712#Sx2.SS1.p3.1)\.
- \[26\]T\. Thebaud, Y\. Lu, M\. Wiesner, P\. Viechnicki, and N\. Dehak\(2025\-09\)Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM\.arXiv\.Note:arXiv:2508\.04795 \[cs\] version: 2External Links:[Link](http://arxiv.org/abs/2508.04795),[Document](https://dx.doi.org/10.48550/arXiv.2508.04795)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p3.1),[§II](https://arxiv.org/html/2606.13712#S2.p2.1)\.
- \[27\]TranscribeMe\.\(en\)\.External Links:[Link](https://www.transcribeme.com/audio-transcription-services/)Cited by:[§III](https://arxiv.org/html/2606.13712#S3.p3.1)\.
- \[28\]P\. Xu, J\. Liu, N\. Jones, J\. Cohen, and W\. Ai\(2024\-06\)The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 4375–4389\.External Links:[Link](https://aclanthology.org/2024.naacl-long.246/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.246)Cited by:[§I](https://arxiv.org/html/2606.13712#S1.p1.1),[§I](https://arxiv.org/html/2606.13712#S1.p2.1)\.Similar Articles
Multimodal Music Recommendation System using LLMs
Proposes a multimodal framework integrating audio, lyric, and semantic signals with LLM-based sequential reasoning for session-based music recommendation, achieving up to 95% recall improvement over ID-only baselines.
Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition
This paper proposes a plug-and-play module using self-paced curriculum learning to enhance modality balance in multimodal conversational emotion recognition, achieving consistent F1-score improvements on IEMOCAP and MELD datasets.
Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings
This paper evaluates the abilities of large language models (LLMs) and multimodal LLMs for addressee detection, turn-change prediction, and next speaker prediction in multi-party meeting conversations. Results show text-based LLMs outperform supervised models and humans in next speaker prediction, while multimodal LLMs improve over text-only models in other tasks but remain below human performance.
Context-Aware Multimodal Claim Verification in Spoken Dialogues
This paper introduces MAD2, a new benchmark for multimodal claim verification in spoken dialogues, and proposes a calibrated fusion of audio and text models that leverages conversational context to improve verification accuracy.
Detecting Alarming Student Verbal Responses using Text and Audio Classifier
This paper presents a hybrid framework for detecting alarming or distressed student verbal responses by combining a text classifier (content-based) and an audio classifier (prosodic features), aimed at expediting human review in Automated Verbal Response Scoring systems. The approach addresses a safety gap in automated scoring pipelines where at-risk student responses may otherwise go unnoticed.