Tag
This paper evaluates a multimodal framework for speaker identification in K-12 classrooms by combining acoustic embeddings (ECAPA-TDNN) with LLM-derived semantic context from transcripts, improving accuracy from 39% to 50.3% overall and from 64.9% to 76.9% for longer utterances.