Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?

arXiv cs.AI 06/16/26, 04:00 AM Papers

llm aphasia discourse-analysis few-shot natural-language-processing clinical-nlp evaluation

Summary

This study investigates whether instruction-tuned LLMs (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, Phi-3-mini) can reliably classify Correct Information Units in aphasic discourse transcripts. Few-shot prompting yields competitive F1 scores (0.776–0.817) for three models, but performance varies by severity and human agreement remains insufficient for fully autonomous use.

arXiv:2606.15696v1 Announce Type: new Abstract: Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone. However, CIU scoring is time intensive and requires trained raters. This study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts. Sixteen picture-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa. Zero-shot prompting was insufficient across models. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. Mean few-shot F1 scores ranged from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B, with no significant differences between fixed global and per-chunk local example selection. Phi-3-mini was unstable and did not yield reliable performance. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia. Few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:48 AM

# Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?
Source: [https://arxiv.org/abs/2606.15696](https://arxiv.org/abs/2606.15696)
[View PDF](https://arxiv.org/pdf/2606.15696)

> Abstract:Correct Information Units \(CIUs\) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone\. However, CIU scoring is time intensive and requires trained raters\. This study examined whether instruction\-tuned large language models \(LLMs\) can reliably perform token\-level CIU classification from aphasic discourse transcripts\. Sixteen picture\-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire \(1993\)\. The sample spanned four severity strata: control, mild, moderate, and severe aphasia\. Four publicly available instruction\-tuned LLMs were benchmarked under zero\-shot and two few\-shot prompting conditions across five stratified random seeds\. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa\. Zero\-shot prompting was insufficient across models\. In contrast, few\-shot prompting yielded substantial gains and produced competitive performance for three viable models\. Mean few\-shot F1 scores ranged from 0\.776 to 0\.817 across Llama\-3\.1\-8B, Qwen2\.5\-7B, and Mistral\-7B, with no significant differences between fixed global and per\-chunk local example selection\. Phi\-3\-mini was unstable and did not yield reliable performance\. Viable models showed high recall but lower precision, suggesting systematic over\-classification of tokens as CIUs\. Performance also varied by discourse severity, with the weakest results in more severe aphasia\. Few\-shot LLM prompting can support automated CIU identification without gradient\-based task training, but agreement with human annotation remains insufficient for fully autonomous use\. These findings support LLM\-based CIU scoring as a promising human\-in\-the\-loop component of discourse assessment systems\.

## Submission history

From: Jason Pittman \[[view email](https://arxiv.org/show-email/a226f98c/2606.15696)\] **\[v1\]**Fri, 10 Apr 2026 01:53:35 UTC \(686 KB\)

Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?

Similar Articles

Are you speaking my languages? On spoken language adherence in multimodal LLMs

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Don't let the LLM speak, just probe it (8 minute read)

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Submit Feedback

Similar Articles

Are you speaking my languages? On spoken language adherence in multimodal LLMs

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Don't let the LLM speak, just probe it (8 minute read)

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment