Tag
This study investigates whether instruction-tuned LLMs (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, Phi-3-mini) can reliably classify Correct Information Units in aphasic discourse transcripts. Few-shot prompting yields competitive F1 scores (0.776–0.817) for three models, but performance varies by severity and human agreement remains insufficient for fully autonomous use.
This paper critiques the use of single-reference ground truth in ASR evaluation, arguing it causes epistemic injustice for speakers with aphasia. It proposes a new metric, Epistemic Injustice Distance, and advocates for WER-Range to account for diverse transcription conventions.