Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

arXiv cs.AI Papers

Summary

This paper evaluates the open-weight LLM LLaMA 3.1 for automatic extraction of structured data from Dutch brain MRI reports, achieving high performance on visual rating scores and accurate detection of findings, with few-shot prompting improving extraction of numerical variables.

arXiv:2606.07721v1 Announce Type: new Abstract: Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:52 AM

# Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model
Source: [https://arxiv.org/abs/2606.07721](https://arxiv.org/abs/2606.07721)
Authors:[Kaouther Mouheb](https://arxiv.org/search/cs?searchtype=author&query=Mouheb,+K),[Amos Pomp](https://arxiv.org/search/cs?searchtype=author&query=Pomp,+A),[Antoine Manenti](https://arxiv.org/search/cs?searchtype=author&query=Manenti,+A),[Romy de Haan](https://arxiv.org/search/cs?searchtype=author&query=de+Haan,+R),[Farog Faghir](https://arxiv.org/search/cs?searchtype=author&query=Faghir,+F),[Joy Martens](https://arxiv.org/search/cs?searchtype=author&query=Martens,+J),[Harro Seelaar](https://arxiv.org/search/cs?searchtype=author&query=Seelaar,+H),[Francesco Mattace\-Raso](https://arxiv.org/search/cs?searchtype=author&query=Mattace-Raso,+F),[Meike W\. Vernooij](https://arxiv.org/search/cs?searchtype=author&query=Vernooij,+M+W),[Frank J\. Wolters](https://arxiv.org/search/cs?searchtype=author&query=Wolters,+F+J),[Stefan Klein](https://arxiv.org/search/cs?searchtype=author&query=Klein,+S),[Esther E\. Bron](https://arxiv.org/search/cs?searchtype=author&query=Bron,+E+E)

[View PDF](https://arxiv.org/pdf/2606.07721)

> Abstract:Objectives: Automatic data extraction from free\-text radiology reports enables large\-scale research, but few studies assessed the performance of large language models \(LLMs\) on Dutch neuroradiology reports\. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic \(2016\-2021\), authored by consultant neuroradiologists\. Trained medical students annotated thirty variables; 100 reports were double\-annotated to assess inter\-rater reliability\. We evaluated the performance of the open\-weight LLM LLaMA 3\.1 using different languages \(Dutch vs\. English translation\) and few\-shot prompting with different example selection strategies\. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free\-text\. Metrics were computed across 10 random splits of the 947 reports\. Results: LLaMA 3\.1 demonstrated high zero\-shot performance for visual rating scores \(mean \[95%\-CI\]\): Medial Temporal Atrophy: 90% \[77\-100%\] on the left and 96% \[94\-99%\] on the right, Global Cortical Atrophy: 87% \[83\-91%\], and Fazekas: 94% \[93\-96%\]\. Microbleed mentions were detected with 93% accuracy \[92\-95%\] and infarct mentions with 82% \[80\-84%\]\. Text similarity for lesion location reached 0\.95 \[0\.95\-0\.96\]\. Performance was lower for numerical variables: 80% \[78\-82%\] for the number of microbleeds and 66% \[63\-68%\] for infarcts\. English translation yielded comparable results\. Few\-shot prompting improved performance for numerical variables, achieving 92% \[90\-93%\] for microbleeds and 81% \[77\-85%\] for infarcts using structural similarity\-based selection\. Conclusion: LLaMA 3\.1 shows strong potential for extracting data from Dutch neuroradiology reports\. Few\-shot prompting enhances performance for numerical variables, whereas challenges remain for location\-specific variables\.

## Submission history

From: Kaouther Mouheb \[[view email](https://arxiv.org/show-email/fa603aba/2606.07721)\] **\[v1\]**Fri, 5 Jun 2026 15:57:35 UTC \(6,056 KB\)

Similar Articles

Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction

arXiv cs.CL

Researchers propose Brain-CLIPLM, a two-stage EEG-to-text decoding framework using contrastive learning for semantic anchor extraction and a retrieval-grounded LLM with Chain-of-Thought reasoning, achieving 67.55% top-5 sentence retrieval accuracy and suggesting EEG-to-text decoding should focus on recovering compressed semantic content rather than full sentence reconstruction.