Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?

arXiv cs.AI Papers

Summary

This study investigates whether instruction-tuned LLMs (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, Phi-3-mini) can reliably classify Correct Information Units in aphasic discourse transcripts. Few-shot prompting yields competitive F1 scores (0.776–0.817) for three models, but performance varies by severity and human agreement remains insufficient for fully autonomous use.

arXiv:2606.15696v1 Announce Type: new Abstract: Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone. However, CIU scoring is time intensive and requires trained raters. This study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts. Sixteen picture-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa. Zero-shot prompting was insufficient across models. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. Mean few-shot F1 scores ranged from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B, with no significant differences between fixed global and per-chunk local example selection. Phi-3-mini was unstable and did not yield reliable performance. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia. Few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:48 AM

# Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?
Source: [https://arxiv.org/abs/2606.15696](https://arxiv.org/abs/2606.15696)
[View PDF](https://arxiv.org/pdf/2606.15696)

> Abstract:Correct Information Units \(CIUs\) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone\. However, CIU scoring is time intensive and requires trained raters\. This study examined whether instruction\-tuned large language models \(LLMs\) can reliably perform token\-level CIU classification from aphasic discourse transcripts\. Sixteen picture\-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire \(1993\)\. The sample spanned four severity strata: control, mild, moderate, and severe aphasia\. Four publicly available instruction\-tuned LLMs were benchmarked under zero\-shot and two few\-shot prompting conditions across five stratified random seeds\. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa\. Zero\-shot prompting was insufficient across models\. In contrast, few\-shot prompting yielded substantial gains and produced competitive performance for three viable models\. Mean few\-shot F1 scores ranged from 0\.776 to 0\.817 across Llama\-3\.1\-8B, Qwen2\.5\-7B, and Mistral\-7B, with no significant differences between fixed global and per\-chunk local example selection\. Phi\-3\-mini was unstable and did not yield reliable performance\. Viable models showed high recall but lower precision, suggesting systematic over\-classification of tokens as CIUs\. Performance also varied by discourse severity, with the weakest results in more severe aphasia\. Few\-shot LLM prompting can support automated CIU identification without gradient\-based task training, but agreement with human annotation remains insufficient for fully autonomous use\. These findings support LLM\-based CIU scoring as a promising human\-in\-the\-loop component of discourse assessment systems\.

## Submission history

From: Jason Pittman \[[view email](https://arxiv.org/show-email/a226f98c/2606.15696)\] **\[v1\]**Fri, 10 Apr 2026 01:53:35 UTC \(686 KB\)

Similar Articles

Are you speaking my languages? On spoken language adherence in multimodal LLMs

arXiv cs.CL

This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

arXiv cs.CL

This paper evaluates the abilities of large language models (LLMs) and multimodal LLMs for addressee detection, turn-change prediction, and next speaker prediction in multi-party meeting conversations. Results show text-based LLMs outperform supervised models and humans in next speaker prediction, while multimodal LLMs improve over text-only models in other tasks but remain below human performance.

Don't let the LLM speak, just probe it (8 minute read)

TLDR AI

The article introduces a technique that extracts hidden states from an LLM at the last prompt token to perform classification without text generation, using a small MLP to read the model's internal decision, enabling fast and cheap zero-shot classifiers.