Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Summary
This paper introduces Checkup2Action, a multimodal dataset and benchmark for generating patient-oriented action cards from clinical check-up reports, addressing the interpretability gap for laypersons.
View Cached Full Text
Cached at: 05/13/26, 06:13 AM
# Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Source: [https://arxiv.org/html/2605.11533](https://arxiv.org/html/2605.11533)
\\addauthor
Sike Xiangsike\.xiang@durham\.ac\.uk1\\addauthorShuang Chenshuang\.chen@durham\.ac\.uk1\\addauthorKevin Qinghong Linkevin\.qh\.lin@gmail\.com2\\addauthorJialin Yuyu\.jialin@outlook\.com2\\addauthorYijia Sunyijia\.sun@durham\.ac\.uk1\\addauthorPhilip Torrphilip\.torr@eng\.ox\.ac\.uk2\\addauthorAmir Atapour\-Abarghoueiamir\.atapour\-abarghouei@durham\.ac\.uk1\\addinstitutionDurham University Durham, UK\\addinstitutionUniversity of Oxford Oxford, UK BMVC Author Guidelines
###### Abstract
Clinical check\-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain\-specific terminology\. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow\-up actions\. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient\-oriented actions from multimodal check\-up reports remains under\-benchma\-rked\. We presentCheckup2Action, a multimodal clinical check\-up report dataset and benchmark for structuredAction Cardgeneration\. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow\-up time window, patient\-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment\-prescriptive claims\. The dataset contains 2,000 de\-identified real\-world check\-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, imaging\-related evidence, and physician summaries\. We formulate checkup\-to\-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance\. Experiments with general\-purpose and medical large language models reveal clear trade\-offs between issue coverage, action correctness, conciseness, and safety alignment\. Checkup2Action provides a new multimodal benchmark for evaluating patient\-oriented reasoning over clinical check\-up reports\.
## 1Introduction
Routine clinical check\-ups generate multimodal reports that combine visual document layouts, tabular laboratory results, numerical biomarkers, abnormality flags, specialised symbols, imaging\-related findings, and free\-text physician comments\. Unlike single\-section clinical notes, these reports are often multi\-page documents in which evidence is distributed across heterogeneous regions, including tables, structured examination blocks, scanned pages, and imaging summaries\. For laypersons, such dense and visually structured clinical artefacts are difficult to interpret: abnormal values, reference ranges, arrows, and templated medical phrases must be read together before a meaningful follow\-up decision can be made\. This creates a substantial “interpretability gap” between multimodal check\-up evidence and concrete patient action\[gap1,VanDerMee2024LabResultsFormats,Petrovskaya2023PortalTestResultsScopingReview,gap3\]\.
Figure 1:Real\-world motivation for Checkup2Action\. Patients often receive multimodal clinical check\-up reports containing structured measurements, abnormality flags, imaging\-related findings, and physician comments, but may struggle to decide what to do next\. Checkup2Action converts such reports into prioritised, patient\-facing Action Cards that support appropriate follow\-up consultations and concrete next steps\.In real\-world practice, patients often read check\-up reports by scanning visually salient cues, such as “abnormal” labels,↑/↓\\uparrow/\\downarrowsymbols, positive test indicators, highlighted reference\-range violations, or concluding phrases in imaging and laboratory sections\. However, these cues are distributed across multimodal report regions and do not by themselves indicate which findings are clinically important, which deviations are mild, or what follow\-up action should be taken\[abnormal\]\. Prior work shows that people with limited health literacy struggle to interpret heterogeneous clinical evidence, including laboratory values, radiology conclusions, and phrases such as “correlate clinically” or “mildly elevated”, which can lead either to disproportionate anxiety around isolated cues or to missed high\-risk findings that require intervention\[i5,gap3\]\. Even when reports contain an “overall conclusion” or “summary opinion”, these sections are usually descriptive rather than action\-oriented, leaving patients uncertain about priority, department referral, or follow\-up timing and often dependent on further clarification\[i7\]\. This motivates a benchmark for converting multimodal check\-up reports into structured, patient\-facing next\-step plans\.
Recent advances in large language models and conversational AI have enabled clinical summarisation, patient\-facing report simplification, and triage assistance\[gap3,masanneck2024\_triage\_llms\]\. However, most existing systems still produce free\-form explanations or acuity labels rather than structured, patient\-oriented action plans\. They rarely evaluate whether a model can organise multimodal check\-up evidence into explicit priorities, recommended departments, follow\-up time windows, and concrete questions for clinicians\[bluethgen2025agenticsystemsradiologydesign\]\. On the evaluation side, prior work often relies on text similarity metrics or small\-scale expert ratings, leaving open whether systems consistently identify clinically relevant issues, rank them appropriately, and provide safe next\-step guidance for patients\[tam2024frameworkhumanevaluationlarge\]\.
In this context, we construct theC2A \(Checkup2Action\) benchmark dataset\(Section[3](https://arxiv.org/html/2605.11533#S3)\) for multimodal check\-up\-to\-action generation\. The dataset contains de\-identified real\-world check\-up reports and supports evaluation of whether a system can transform visually structured, multi\-section clinical evidence into prioritised patient\-facing Action Cards\. We further introduce an evaluation framework that combines structured metrics \(problem recall, priority accuracy, department accuracy, time accuracy, and action complexity bias\) with human\-centred ratings \(problem relevance, safety, usefulness, clarity, and tone\), yielding ten complementary metrics that capture both system performance and user\-perceived quality\.
Built on this benchmark, we instantiateCheckup2Action\(Section[4](https://arxiv.org/html/2605.11533#S4)\) as a constrained baseline workflow for generating structured “Action Cards”\. Each card focuses on a single issue and specifies its priority, recommended department, suggested follow\-up time window, patient\-facing explanation, and questions to ask a clinician\. Figure[1](https://arxiv.org/html/2605.11533#S1.F1)illustrates the real\-world workflow: after a check\-up, the patient receives a report, and Checkup2Action converts it into action cards that help the patient prepare appropriate next steps\. We deliberately constrain the system’s scope to interpretation and action planning, organising existing findings into follow\-up recommendations without issuing new diagnostic labels or medication plans\.
Our primary contributions are thus as follows:
1. \(i\)We introduceC2A, a real\-world multimodal clinical check\-up report dataset and benchmark containing 2,000 de\-identified reports with expert annotations for patient\-oriented Action Card generation\.
2. \(ii\)We formulate checkup\-to\-action generation as a structured multimodal report understanding task and provide an evaluation protocol that jointly measures issue coverage and precision, prioritisation consistency, department and follow\-up time recommendation quality, action complexity, usefulness, readability, and safety compliance\.
3. \(iii\)We instantiateCheckup2Action, a constrained baseline workflow that converts multi\-section check\-up evidence into ordered, patient\-facing Action Cards while avoiding diagnostic and treatment\-prescriptive outputs\.
## 2Related Work
We review related work on multimodal health check\-up report understanding and datasets \(Section[2\.1](https://arxiv.org/html/2605.11533#S2.SS1)\), followed by medical AI agents for clinical summarisation, patient\-facing communication, and triage support \(Section[2\.2](https://arxiv.org/html/2605.11533#S2.SS2)\)\.
### 2\.1Health Check\-up Reports and Related Datasets
Routine health check\-ups typically include multiple examination types, such as vital signs, laboratory tests, functional tests, and imaging or ultrasound examinations\. Although check\-up packages vary across settings, they commonly centre on cardiometabolic risk indicators such as blood pressure, cholesterol, adiposity measures, and, where appropriate, glucose\-related testing\[Araujo2025PeriodicHealthExams,US\_Preventive\]\. Their reports are usually multimodal clinical documents: numeric tables, reference ranges, abnormality flags, structured examination blocks, imaging\-related summaries, and free\-text conclusions are arranged across visually distinct report sections\[VanDerMee2024LabResultsFormats\]\. In laboratory and imaging sections, structured templates and standardised terminology can improve documentation consistency, but they also create comprehension barriers for non\-professional users\[ESR2023StructuredReportingUpdate\]\. Patients and health care providers further report that access to test results through web portals often requires additional explanation and guidance to support appropriate follow\-up actions\[Petrovskaya2023PortalTestResultsScopingReview\]\. Simply presenting numerical results with reference ranges does not ensure interpretability, and the limitations of reference intervals can lead to confusion or misinterpretation\[Timbrell2024ReferenceIntervalLimitations\]\.
Existing resources have supported medical report understanding from several perspectives, including radiology report simplification for patient understanding\[yang\-etal\-2023\-data\]and paired medical image\-report datasets such as MIMIC\-CXR\[mimiccxr\]\. However, these resources primarily target simplification, descriptive generation, image\-report modelling, or general medical understanding\. They do not directly evaluate whether a system can convert real\-world multimodal check\-up reports into structured, prioritised, and patient\-facing next\-step plans\. In particular, there remains a lack of standardised datasets and benchmarks that jointly assess issue identification, priority ranking, department recommendation, follow\-up timing, output conciseness, and safety compliance in routine check\-up scenarios\.
### 2\.2Medical AI Agents
Large language models have increasingly been used as agents that combine instruction following, reasoning, tool use, and external actions\. General agentic methods such as ReAct\[ReAct\]and Toolformer\[Toolformer\]study how models can interleave reasoning with actions or learn to use tools, while agent benchmarks and platforms such as OpenHands\[OpenHands\], SWE\-agent\[SWE\], Mind2Web\[Mind2Web\], and WebArena\[WebArena\]demonstrate the importance of reproducible evaluation in interactive environments\. These studies motivate agentic workflows, but they do not address the specific safety and evaluation requirements of patient\-facing clinical report interpretation\.
In the medical domain, large language models and agentic frameworks have been explored for question answering, clinical decision support, and medical documentation generation\[Wang2024\]\. A major line of work focuses on clinical summarisation, such as producing concise overviews from electronic health records or discharge summaries to support clinician review\[Bednarczyk2025\]\. Another line targets patient\-facing communication by rewriting technical medical documents into more accessible explanations while balancing readability and information preservation\[jamanetworkopen\]\. Closely related studies investigate triage and acuity assessment, comparing model performance with emergency medicine professionals or proposing multi\-agent systems for clinical triage\[masanneck2024\_triage\_llms,lu\-etal\-2024\-triageagent\]\. However, most existing systems are designed for clinicians, institutions, or acute triage settings, and rarely evaluate whether multimodal check\-up reports can be converted into concrete, prioritised next steps for lay users\. Checkup2Action addresses this gap by providing a dedicated multimodal dataset and benchmark for structured patient\-facing Action Card generation\.
## 3Datasets and Benchmark
### 3\.1C2A Dataset
In this study, we buildC2A \(Checkup2Action\), a real\-world multimodal dataset for check\-up\-to\-action generation\. The current version contains 2,000 de\-identified full check\-up reports in PDF format\. Each report is a multi\-page clinical document containing heterogeneous evidence from individual examination items, including visual layouts, structured tables, numerical measurements, abnormality flags, embedded images, imaging\-related findings, and physician\-written conclusions, as summarised in Figure[2](https://arxiv.org/html/2605.11533#S3.F2)\. The reports follow standardised check\-up documentation workflows used in medical institutions and cover a broad range of abnormal findings and risk indicators\. The publicly released version will provide a standardised English edition of the dataset111Source code and dataset will be publicly available after the review period\.\.
Each report covers five major categories:Basic Information, including demographic attributes and vital signs such as age, sex, height, weight, and blood pressure;General Physical Examination, covering routine clinical assessment findings such as vision and physical signs;Laboratory Tests, including standard haematology and biochemical panels such as routine blood counts, liver function, and renal function tests;Imaging Examinations, covering radiological and sonographic findings such as chest X\-ray and abdominal ultrasound; andCardiovascular Tests, including cardiac and vascular assessments such as electrocardiography\.
For benchmark construction, physician\-written report summaries are used to derive reference issues and action\-card attributes for evaluation\. We treat each complete report as a single sample rather than splitting it into isolated tests, preserving the document\-level context and cross\-section evidence aggregation required in real check\-up interpretation\. To avoid evaluating only local abnormality extraction, the benchmark requires systems to organise findings across sections into patient\-facing issues, priorities, department recommendations, follow\-up timing, and safe explanatory content\.
We parsed and summarised the check\-up report PDFs to characterise the dataset\. Report length ranges from 8–25 pages \(mean 15\.08\), and per\-report text ranges from 2,301–10,223 characters \(mean 5,379\.5\), corresponding to 342–2,107 words \(mean 1,107\.4\)\. Each report contains 11–27 embedded images \(mean 17\.12\)\. At the reference level, the number of issues per report ranges from 2–28 \(mean 7\.2\), and the priority distribution is High 7\.4%, Medium 20\.8%, and Low 71\.8%\. We will release more detailed statistics and distribution tables alongside the public dataset so that readers can assess scale, difficulty, and class imbalance\.
Figure 2:Overview of the C2A dataset\. The dataset is built from real\-world multimodal clinical check\-up reports and covers diverse examination information, including demographic and vital\-sign records, physical examination findings, laboratory tests, imaging examinations, and cardiovascular assessments\.
### 3\.2Benchmark and Evaluation Metrics
Building on the C2A dataset and reference annotations derived from physician\-written summaries, we construct the Checkup2Action evaluation benchmark\. The benchmark contains two complementary components: structured consistency and subjective quality\. The structured component assesses whether generated Action Cards align with clinical references in terms of issue coverage, priority, recommended department, and follow\-up timing\. The subjective component evaluates whether the cards are useful, safe, readable, and appropriately worded for lay users under realistic check\-up scenarios\.
#### 3\.2\.1Structured Consistency
At the issue level, we match each reference issue to the most semantically similar model\-generated issue using cosine similarity over sentence embeddings\. A reference issue is consideredcoveredif the similarity exceeds a fixed threshold ofτ=0\.8\\tau=0\.8, motivated by prior clinical NLP evaluations that use high cosine similarity as an acceptable semantic\-match criterion\[KADHIM2026100895,bioengineering12111194\]\. The proportion of covered reference issues definesProblem Recall, measuring whether the model surfaces the health issues that should be brought to the user’s attention\. We also derive issue precision and combined scores to analyse over\-generation, where models may produce additional cards that are not supported by the reference\.
For covered issues, we further evaluate the structured attributes attached to each card\.Priority Accuracymeasures alignment between the predicted urgency level and the reference, with partial credit for near\-miss predictions\.Department Accuracymeasures whether the recommended clinical department matches the reference\.Time Accuracymeasures whether the suggested consultation or follow\-up time window matches the reference category, using ordered labels such as immediate, as soon as possible, near\-term, and routine follow\-up\. Finally,Action Complexity Biasmeasures the deviation between the number of generated cards and reference cards for each report, averaged across samples\. This metric indicates whether a model tends to over\-generate or under\-generate action cards, which is important for controlling output length and information burden in patient\-facing use\.
#### 3\.2\.2Subjective Quality
Beyond structured metrics, Checkup2Action includes five 1–5 subjective dimensions for evaluating the overall quality of generated Action Cards\.Problem relevanceassesses whether the cards are tightly linked to abnormalities or potential risks in the check\-up report rather than generic advice\.Safetyevaluates whether the model stays within its intended role of explaining findings and suggesting follow\-up actions without making diagnoses or inappropriate treatment recommendations\.Usefulnessmeasures whether a layperson can understand what to do next and whether the suggested actions are practically executable\.Clarityexamines whether the language is well organised, fluent, and makes key information visible\.Toneassesses whether the wording is professional and reassuring without being harsh, exaggerated, or anxiety\-provoking\. In practice, each generated report\-level output is scored along these five dimensions and averaged within each dimension\. Together with the structured metrics, these ratings provide a complementary assessment of patient\-facing quality and safety\.
## 4Checkup2Action
Figure 3:Overview of the Checkup2Action pipeline\. A multimodal check\-up report is first ingested and parsed, then organised into structured clinical sections covering basic information, physical examinations, laboratory tests, cardiovascular tests, and imaging examinations\. The segmented evidence is used to generate prioritised patient\-facing Action Cards with recommended department, follow\-up timing, and questions for clinicians\.The Checkup2Action workflow is shown in Figure[3](https://arxiv.org/html/2605.11533#S4.F3)\. Given a multimodal check\-up report, the system outputs an ordered sequence of structured Action Cards\. Rather than using a single end\-to\-end prompt, the workflow separates report ingestion, multimodal section segmentation, schema\-constrained generation, and post\-generation verification\. This decomposition is important for real\-world check\-up reports, which are often multi\-page PDF documents containing visual layouts, free text, numerical tables, reference intervals, abnormality flags, physician comments, ECG traces, and imaging\-related evidence\.
At the pipeline level, the system follows three steps\.Step 1parses the input report and converts it into a cleaned linearised representation that preserves clinically relevant content while reducing layout noise\.Step 2organises the parsed content into structured multimodal sections, such as basic information, physical examination, laboratory tests, cardiovascular tests, and imaging examinations, so that evidence scattered across pages and modalities can be grouped into coherent clinical units\. Under the role constraints and output schema described below,Step 3performs schema\-constrained generation to produce JSON\-formatted Action Cards, followed by schema validation, priority\-based ordering, cross\-section consistency checking, and safety review\.
### 4\.1System Pipeline
#### 4\.1\.1Step 1: Report Ingestion
Given a check\-up report documentd∈𝒟d\\in\\mathcal\{D\}, we parse and normalise it into a linearised context:
x=ℒ\(d\),x=\\mathcal\{L\}\(d\),\(1\)whereℒ\(⋅\)\\mathcal\{L\}\(\\cdot\)denotes a deterministic extraction, cleaning, and normalisation procedure\. For PDF reports, we perform page\-level parsing and layout extraction to recover heterogeneous report components, including text blocks, table\-like regions, abnormality markers, and image objects\. The recovered content is then cleaned and linearised into a unified representation while preserving key clinical fields such as item names, measured values, units, reference ranges, abnormal flags, physician comments, and imaging\-related findings\. Document artefacts such as headers, footers, duplicated page fragments, and irrelevant decorative elements are removed to reduce downstream noise\.
Optionally, a short user\-provided summaryuucan be incorporated by concatenation:
x′=\[x;u\]\.x^\{\\prime\}=\[x\\,;\\,u\]\.\(2\)When no additional summary is provided, we setx′=xx^\{\\prime\}=x\. The output of this stage serves as the global clinical context for subsequent segmentation and generation\.
#### 4\.1\.2Step 2: Multimodal Medical Data Segmentation
We then organise the linearised content intommstructured sections:
S=𝒮\(x′\)=\{s1,s2,…,sm\},sj=\(tj,zj\),S=\\mathcal\{S\}\(x^\{\\prime\}\)=\\\{s\_\{1\},s\_\{2\},\\dots,s\_\{m\}\\\},\\qquad s\_\{j\}=\(t\_\{j\},z\_\{j\}\),\(3\)wheretjt\_\{j\}is the section type, such as basic information, physical examination, laboratory tests, cardiovascular tests, imaging examinations, or physician summary, andzjz\_\{j\}is the corresponding text span or extracted content\. Equivalently,𝒮\(⋅\)\\mathcal\{S\}\(\\cdot\)can be viewed as predicting a set of section boundaries:
B=\{\(bj,ej\)\}j=1m,zj=x′\[bj:ej\]\.B=\\\{\(b\_\{j\},e\_\{j\}\)\\\}\_\{j=1\}^\{m\},\\qquad z\_\{j\}=x^\{\\prime\}\[b\_\{j\}:e\_\{j\}\]\.\(4\)
Rather than treating segmentation as a purely free\-form generation problem, we adopt a hybrid strategy combining tool\-based parsing, rule\-based localisation, and LLM verification\. First, the parser provides candidate regions from the recovered page structure\. Second, rule\-based cues such as section headers, table titles, item templates, repeated report patterns, and known examination keywords are used to localise candidate boundaries\. Finally, the LLM verifies the available sections and their corresponding page or span ranges, resolving ambiguous cases and refining assignments when evidence from one clinical category is distributed across multiple pages or mixed with other content\.
#### 4\.1\.3Step 3: Action Card Generation
Given the segmented report representationSS, the model generates a JSON\-formatted output stringy^\\hat\{y\}under schema and behavioural constraints:
y^=argmaxy∈Ωpθ\(y∣S\),\\hat\{y\}=\\arg\\max\_\{y\\in\\Omega\}\\;p\_\{\\theta\}\(y\\mid S\),\(5\)wherepθp\_\{\\theta\}denotes the large language model andΩ\\Omegais the set of valid outputs that satisfy the Action Card schema, field completeness requirements, allowed priority values, and safety constraints that prohibit diagnostic or treatment\-prescriptive claims\. The model output is then parsed into an ordered sequence of Action Cards:
𝒞=\{c1,c2,…,cn\}\.\\mathcal\{C\}=\\\{c\_\{1\},c\_\{2\},\\dots,c\_\{n\}\\\}\.
Each Action Cardcic\_\{i\}is a structured object describing one health issue that may require attention, together with a corresponding next\-step plan\. We model each card as a 6\-tuple:
ci=\(f1\(i\),f2\(i\),…,f6\(i\)\),c\_\{i\}=\\bigl\(f^\{\(i\)\}\_\{1\},f^\{\(i\)\}\_\{2\},\\dots,f^\{\(i\)\}\_\{6\}\\bigr\),wheref1\(i\)f^\{\(i\)\}\_\{1\}–f6\(i\)f^\{\(i\)\}\_\{6\}correspond toproblem,priority,why,department,time\_sugg\-estion, andquestions, respectively\. Here,problemsummarises the focal issue,prioritytakes values in \{“high”, “medium”, “long\-term focus”, “note”\},whyprovides a short patient\-facing explanation of why the issue deserves attention,departmentspecifies a recommended clinical department,time\_suggestiondescribes the suggested follow\-up window, andquestionslists questions that the patient may bring to the consultation\. This fixed schema keeps the output actionable and comparable across models while limiting redundancy and consultation overload\.
### 4\.2Behavioural Constraints and Schema
We constrain the large language model to act as an “action translator” for check\-up results rather than as a diagnostic or treatment decision maker\. Through prompts, schema constraints, and post\-generation checks, the model is only allowed to extract, reorganise, and explain information related to next\-step actions from existing examination findings, abnormality cues, and physician\-written comments\. The generated cards should answer what to do next, when to do it, and which department to consult\.
The system is explicitly prohibited from producing definitive diagnostic statements, concrete treatment plans, or drug regimens\. For potentially high\-risk findings, it must use conservative wording such as “seek medical care as soon as possible” or “follow the current doctor’s advice”\. These constraints position Checkup2Action as an information reorganisation and action\-planning workflow, reducing the risk of over\-diagnosis while preserving practical usefulness for patients\.
### 4\.3Output Parsing and Ordering
A post\-processing module extracts a valid JSON fragment from the raw model output, checks the structural completeness of thecardslist and required fields, and sorts the cards according to a predefined priority mapping, for example, “high”\>\>“medium”\>\>“long\-term focus”\>\>“note”\. This module does not repair or alter the semantic content of the model output; it only parses, validates, and reorders the generated cards\. The resulting orderedcardsare then rendered into a user\-facing Action Card view for presentation\. This design ensures that benchmark comparisons across models are based on a clean and controllable output format\.
### 4\.4Task Boundary and Clinical Positioning of Action Cards
Action Cards are designed as a patient\-support layer with clearly defined clinical boundaries\. Their purpose is not to generate new diagnoses or treatment plans, but to reorganise abnormal findings, risk signals, physician comments, and follow\-up cues already present in check\-up reports into structured next\-step guidance\. In other words, Action Cards are not intended to answer “What disease does the patient have?”, but rather “Given the existing check\-up results, what should be prioritised next, within what time frame, which department should be consulted, and which questions should be raised?”
Under this framing, Checkup2Action is best understood as a constrained mechanism for multimodal information reorganisation and action translation, rather than as formal clinical decision support or a substitute for diagnosis\. Ultimate medical judgement and decision\-making remain the responsibility of clinicians\.
## 5Experiments
Table 1:Structured performance on the Checkup2Action benchmark\. Prob\. rec\. = problem recall, Prior\. acc\. = priority accuracy, Dept\. acc\. = department accuracy, Time acc\. = time accuracy, and Act\. compl\. bias = action complexity bias\. Arrows indicate metric direction:↑\\uparrowmeans higher is better, and\|⋅\|↓\|\\cdot\|\\downarrowmeans a smaller absolute value is better\. Values inbold,underline, anditalicsdenote the best, second\-best, and third\-best results in each column, respectively\. Subscripts denote standard deviations\.Table[1](https://arxiv.org/html/2605.11533#S5.T1)summarises structured performance on the Checkup2Action benchmark\. The task is deliberately constrained: generated Action Cards should describe abnormalities, risk signals, and recommended next steps without producing definitive disease diagnoses\. In contrast, reference issues are derived from physician\-written summaries and may contain more explicit diagnostic formulations or high\-suspicion statements\. Problem recall should therefore be interpreted together with safety and output\-complexity metrics, since a model that avoids over\-diagnosis may use more conservative wording and cover fewer diagnostically phrased reference issues\.
### 5\.1Quantitative Results
#### 5\.1\.1Structured Metrics
Across models,Gemini\-3\-pro\-previewachieves the highest problem recall \(0\.602\), suggesting that it surfaces the largest proportion of reference issues\.GPT\-5\.1obtains lower recall \(0\.527\) but performs best or second best on priority accuracy, department accuracy, time accuracy, and action complexity bias\. This indicates a more balanced trade\-off between issue coverage, downstream action correctness, and output length control\. In comparison,Gemini\-3\-pro\-previewfavours broader issue coverage but is weaker on priority assignment and action\-card count calibration\.GPT\-5\-nanoremains close toGPT\-5\.1on priority accuracy but trails on recall and other structured attributes\. The Claude variants show comparatively conservative behaviour:Claude\-haiku\-4\.5achieves the best priority accuracy but has the lowest recall and the largest negative action complexity bias, indicating substantial under\-generation\.Gemini\-2\.5\-flashandQwen3\-VL\-8B\-Instructoccupy the middle range on most metrics, whileGrok\-4\.1\-fastperforms less consistently on department and time recommendation\. Overall, the benchmark disentangles multiple capabilities required for checkup\-to\-action generation: issue identification, priority ranking, department recommendation, follow\-up timing, and output complexity control\.
Problem precision and Problem F1 can also be computed from the same issue\-matching procedure\. We do not include them in the main table to keep the primary comparison focused and because precision and F1 are sensitive to under\-generation and over\-generation\. For example, a conservative model may obtain high precision by producing very few cards while missing clinically relevant issues, whereas an over\-generating model may improve recall while reducing precision\. In our results,GPT\-5\.1achieves approximately 0\.662 precision and 0\.614 F1, whileClaude\-haiku\-4\.5achieves higher precision \(0\.847\) but substantially lower F1 \(0\.371\), reflecting a pronounced under\-generation trade\-off\. Since Action Cards are ranked by priority, future versions of the benchmark may also include rank\-aware retrieval metrics\.
#### 5\.1\.2Subjective Evaluation
Figure[4](https://arxiv.org/html/2605.11533#S5.F4)summarises the subjective evaluation of generated Action Cards across five patient\-facing quality dimensions: problem relevance, safety, usefulness, clarity, and tone\. Overall, model outputs receive relatively high scores, mostly in the 3\.8–4\.9 range, but clear differences remain\.GPT\-5\.1shows the strongest overall subjective profile, with consistently high scores across all dimensions and leading on problem relevance, safety, usefulness, and tone\.Gemini\-3\-pro\-previewachieves the highest clarity score and remains competitive overall, whileGemini\-2\.5\-flashalso performs strongly on tone and safety\. The Claude variants tend to produce safe but more conservative and information\-sparse outputs, consistent with their lower recall and negative action complexity bias in Table[1](https://arxiv.org/html/2605.11533#S5.T1)\.Qwen3\-VL\-8B\-Instructtrails behind on most subjective dimensions\. These trends show that subjective quality is not determined by issue coverage alone: models with higher recall do not necessarily provide the most useful or best calibrated patient\-facing action recommendations\.
Figure 4:Subjective evaluation on the Checkup2Action benchmark across problem relevance, safety, usefulness, clarity, and tone\. Higher scores indicate better patient\-facing quality on all axes\.
### 5\.2Ablation Study
We further design a series of ablation experiments to verify the contribution of the key design modules in Checkup2Action\.
#### 5\.2\.1Safety Constraints
One core ablation focuses on the safety constraint module: we compare a setting in which the model is explicitly instructed to “avoid making concrete diagnoses and not use disease labels” with one in which these safety skills are completely removed\. The results show that, without safety constraints, Problem Recall increases slightly\. However, the subjective Safety and Problem relevance scores both drop noticeably, and the model is more likely to produce diagnostic statements such as “Highly suspicious early\-stage lung cancer\.” As illustrated in Fig\.[5](https://arxiv.org/html/2605.11533#S5.F5), we visualise two versions of the generated action cards for the same check\-up scenario: Fig\.[5](https://arxiv.org/html/2605.11533#S5.F5)\(a\) corresponds to the version without safety constraints, which implicitly suggests “early\-stage lung cancer” and steers the patient toward thinking about staging and treatment\. Fig\.[5](https://arxiv.org/html/2605.11533#S5.F5)\(b\) corresponds to the safety\-constrained version, which only describes objective findings such as “small nodule \+ mildly elevated CEA \+ family history,” explicitly emphasises that the current results are insufficient for a diagnostic conclusion, and guides the patient toward follow\-up imaging and specialist consultation instead\. The two outputs are based on the same input, differing only in whether the safety module is enabled, thereby providing a visual demonstration of the crucial role that safety constraints play in controlling risks in the model outputs\.

\(a\)Example without safety constraints\.

\(b\)Example with safety constraints\.
Figure 5:Comparison between an unconstrained and a safety\-constrained Checkup2Action output\.
#### 5\.2\.2Safety Layer Ablation
Figure[6](https://arxiv.org/html/2605.11533#S5.F6)shows that removing safety filtering dramatically increases Problem Recall from 0\.527 to 0\.825, indicating that, without safety constraints, the model is much more willing to surface potential issues instead of withholding them\. Because the model becomes more willing to make firm judgements about specific findings, it can assign downstream attributes to a broader set of generated issues\. This behaviour leads to consistent but more moderate gains on other metrics, rather than the very large jump seen in Problem Recall: priority accuracy rises from 0\.719 to 0\.812 and department accuracy from 0\.844 to 0\.894\. In contrast, time recommendation accuracy decreases slightly from 0\.915 to 0\.881, suggesting that the safety layer may encourage more conservative timing suggestions that sometimes align better with reference answers in the check\-up setting\. Interestingly, the absolute action complexity bias decreases from 0\.447 to 0\.329, suggesting that the safety layer can push the model toward more elaborate, overly cautious action plans, whereas the unconstrained variant produces more concise recommendations\.
Figure 6:Structured metric comparison between safety\-constrained and unconstrained\.
#### 5\.2\.3Medical Backbone Comparison
We further focus on the two strongest general\-purpose backbones in our benchmark,GP\-T\-5\.1andGemini\-3\-pro\-preview, and compare them with the medical\-domain mod\-elMedGemma\-27B, while keeping the Checkup2Action workflow, constraints, and evaluation protocol unchanged\. This comparison allows us to test whether switching from a general\-purpose backbone to a medical\-domain backbone improves performance\. It is worth noting that most publicly available fine\-tuned medical models were released relatively early and are not multimodal, so we chooseMedGemma\-27B\[medgemma\], one of the most recent and largest medical backbones, as the representative model for this experiment\.
Figure 7:Structured metric comparison betweenGPT\-5\.1,Gemini\-3\-pro\-previewandMedGemma\-27B\.As shown in Fig\.[7](https://arxiv.org/html/2605.11533#S5.F7),GPT\-5\.1,MedGemma, andGemini\-3\-pro\-previewachieve comparable Problem Recall \(0\.527, 0\.530, and 0\.602\), indicating that all three can surface many clinically relevant issues\. However,MedGemmalags behindGPT\-5\.1on downstream decisions \(priority accuracy 0\.689 vs\. 0\.719; department accuracy 0\.650 vs\. 0\.844; time accuracy 0\.791 vs\. 0\.915\), whileGemini\-3\-pro\-previewtrades higher recall for lower priority accuracy \(0\.591\) and substantially higher action complexity bias \(\|bias\|=1\.398\|\\mathrm\{bias\}\|=1\.398\)\.
This trend is consistent with the training objectives of these models\.MedGemmais pretrained on medical QA and clinical reasoning datasets \(e\.g\., MedQA\[MedQA\]and PubMedQA\[PubMedQA\]\) as well as large de\-identified medical imaging corpora, making it strong at professional diagnosis and image interpretation\. In contrast,GPT\-5\.1andGemini\-3\-pro\-previeware general\-purpose assistants trained with RLHF\-style alignment\[RLHF\]for instruction following, conversational safety, and layperson\-facing explanations, which may better match Checkup2Action’s “no diagnosis, only action guidance and safe communication” setting\.
## 6Conclusion
This paper introducesC2A \(Checkup2Action\), a real\-world multimodal clinical check\-up report dataset and benchmark for patient\-facing Action Card generation\. By framing check\-up interpretation as a structured report\-to\-action task, C2A evaluates whether systems can organise heterogeneous evidence from multi\-page reports, including tables, numerical biomarkers, abnormality flags, imaging\-related findings, and physician comments, into prioritised next\-step guidance\. The benchmark jointly measures structured consistency and subjective patient\-facing quality, covering issue coverage, priority assignment, department recommendation, follow\-up timing, action complexity, usefulness, readability, and safety compliance\. Experiments across general\-purpose and medical\-domain large language models show that strong general\-purpose models can provide more balanced action recommendations under the same workflow, while simply switching to a medical\-domain backbone does not automatically improve performance\. The safety ablation further highlights an important trade\-off: removing constraints can increase issue coverage, but also raises the risk of over\-diagnostic and treatment\-oriented outputs\.
Looking forward, this work opens several directions for future research\. The current best\-performing systems rely on general\-purpose large language models rather than models specifically trained for check\-up\-oriented action guidance, so future work can develop dedicated models for multimodal check\-up\-to\-action generation as more annotated data become available\. C2A should also be extended across more diverse hospitals, countries, languages, report templates, terminology, units, and PDF conversion settings to test robustness under real\-world variation\. Beyond offline benchmarking, prospective user studies with lay participants and clinicians are needed to evaluate whether Action Cards improve patient comprehension, reduce anxiety, increase appropriate follow\-up adherence, and affect clinician workload in real workflows\. Overall, Checkup2Action provides a reusable multimodal benchmark and constrained baseline workflow for studying how clinical check\-up reports can be transformed into safe, structured, and actionable patient\-facing guidance\.
## ReferencesSimilar Articles
MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs
This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
This paper presents a structured framework for benchmarking generative, multimodal, and agentic AI in healthcare, addressing the gap between high benchmark scores and real-world clinical reliability, safety, and relevance.
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
MedCUA-Bench is a new benchmark for evaluating computer-use agents on clinical software tasks, covering 18 scenarios across 10 medical domains with safety dimensions. Results show that current agents perform poorly, especially on real OpenEMR, highlighting a significant gap in reliability.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
IndicMedDialog is a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages, with a fine-tuned model for personalized symptom elicitation. The dataset is derived from MDDial, enhanced with LLM-generated synthetic consultations and expert verification, supporting multilingual healthcare AI.
Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries
This paper explores using few-shot prompted LLMs for actionable triage categorization of online patient inquiries into self-care, schedule-visit, urgent-clinician-review, or emergency-referral. The best model (Claude Haiku 4.5 with 12-shot prompting) achieves macro-F1 of 0.475, surpassing supervised baselines, but the authors conclude that LLMs can support triage prioritization and selective human review, not autonomous deployment.