AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows
Summary
Researchers from Banting Health AI present an AI system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction, achieving 89% accuracy compared to 62.6% for standalone LLMs, with AI-assisted workflows completing tasks 40% faster and reducing cognitive demand.
View Cached Full Text
Cached at: 04/20/26, 08:33 AM
# AI-assisted protocol information extraction for improved accuracy and efficiency in clinical trial workflows Source: https://arxiv.org/html/2602.00052 Ramtin Babaeipoura, François Charesta, Madison Wrighta aBanting Health AI, bantinghealth.ai (https://bantinghealth.ai/) 357 Bay St., Toronto, ON, M5H 4A6, Canada ###### Abstract Increasing clinical trial protocol complexity, amendments, and challenges around knowledge management create significant burden for trial teams. Structuring protocol content into standard formats has the potential to improve efficiency, support documentation quality, and strengthen compliance. We evaluate an Artificial Intelligence (AI) system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction. We compare the extraction accuracy of our clinical-trial-specific RAG process against that of publicly available (standalone) LLMs. We also assess the operational impact of AI-assistance on simulated extraction Clinical Research Coordinator (CRC) workflows. Our RAG process shows higher extraction accuracy (89.0%) than standalone LLMs with fine-tuned prompts (62.6%) against expert-supported reference annotations. In simulated extraction workflows, AI-assisted tasks are completed ≥40% faster, are rated as less cognitively demanding and are strongly preferred by users. While expert oversight remains essential, this suggests that AI-assisted extraction can enable protocol intelligence at scale, motivating the integration of similar methodologies into real-world clinical workflows to further validate its impact on feasibility, study start-up, and post-activation monitoring. Keywords: Clinical trials, CRC workflows, Protocols, Information extraction, Schedule of Events, RAG, LLM ## 1 Introduction When properly planned and executed, clinical trials are known to be the best experimental method to evaluate the effectiveness and safety of a medical intervention. A clinical trial protocol constitutes a written agreement between investigators, research teams, participants, and the scientific community that assists communication by providing the trial's background, objectives and details about its design and organization[friedman2015]. It therefore contains foundational information that teams must extract and interpret in order to ensure a consistent and compliant execution. However, as protocol complexity increases[jones2013,varse2019,getz2018], completeness and adherence to quality guidelines of protocol documents vary[gryaznov2022] and time-consuming, avoidable amendments have become more frequent[getz2024]. In this context, structured data extraction from protocol documents and referencing can be time-intensive and prone to inconsistencies[datta2024,kramer2025] even though it has the potential to improve downstream efficiency, support documentation quality, facilitate the ethical review process, and strengthen compliance[kargren2023,georgieff2023,fda2023]. Ultimately, protocol quality and review efficiency gains would reduce burden and delays while improving evidence generation, transparency and translation to better healthcare[chan2025]. Traditionally, protocol structuring, understanding and operationalization rely on expert-driven review, iterative cross-functional clarification, and manual abstraction of key elements (e.g. endpoints, interventions, eligibility, safety, visit schedules) into spreadsheets and downstream systems (e.g. CTMS, IRB platforms, EDC). Given its highly manual execution, this process is time-consuming and introduces avoidable variability, incompleteness and inaccuracies through repeated transcription, fragmented handoffs and amendments. Comprehensive technological solutions for the extraction and mapping of unstructured protocol data to downstream systems remain nascent, with specialized tools addressing only fragmented parts of the workflow (e.g. isolated scripts, vendor-specific modules) while general-purpose ad-hoc solutions (e.g. conversational AI) typically lack integration, repeatability, performance and compliance. However, LLMs are AI systems capable of distilling complex, unstructured information into key data elements and summaries. They can serve as an assistive layer in established workflows, generating standardized first versions of data and documents that can be submitted to expert verification. By reducing time spent on routine structuring and document navigation, such systems can help teams focus on higher-leverage activities (e.g. adjudication of ambiguous cases, quality oversight) while improving consistency and auditability of protocol-derived data. Beyond base information extraction, they provide automated reasoning and content generation capabilities. The latter have the potential to accelerate protocol document standardization through automated document authoring (e.g.[maleki2024]), but their more immediate application lies in extracting structured information from existing unstructured protocol documents and providing initial automated analyses. While[babaeipour2026] will go beyond information extraction, evaluating the RAG methodology for producing protocol complexity estimations under publicly available frameworks (e.g. scores and rationales over lists of complexity domains), this first paper focuses on information extraction. We design, implement and evaluate a novel clinical-trial-specific RAG system with a SoE-specific methodology for extensive automated protocol information extraction. It combines domain-specific RAG for text-based representations with specialized vision-based methods for tabular SoE data, enabling comprehensive extraction across diverse protocol structures. This addresses the foundational limitations of both standalone LLMs and general-purpose extraction approaches, extending the reach and scale of extraction procedures found in the current literature. We empirically compare our RAG approach against standalone LLMs across 23 publicly available protocols spanning multiple therapeutic areas, demonstrating improved accuracy particularly for complex, scattered information. Through a controlled experiment with 13 CRCs, we assess information extraction accuracy as well as real-world operational impact, measuring time savings, cognitive load, and user preferences. In order to achieve a robust evaluation methodology, we developed an LLM-assisted evaluation and annotation adjudication framework that enables scalable, consistent assessment across hundreds of semi-structured data fields (see table 1 (https://arxiv.org/html/2602.00052#S1.T1)). Table 1: Statement of significance ## 2 Related Work The current literature shows LLMs being used to extract from study documents certain study design features such as eligibility criteria[datta2024,liu2021] and study schedule of events[kramer2025,snorkel2022]. Recent examples feature the extraction of more general semi-structured information from unstructured oncology medical records[wiest2025]. This paper presents a more comprehensive extraction approach. When extracting information from unstructured documents, a direct approach involves prompting an LLM with detailed instructions and providing entire documents as part of its context (standalone LLM). While recent LLMs allow for very large context windows, this approach has theoretical limitations. Among them: - **Context window limits**: These LLMs still can only work with limited document lengths, and protocol documents may exceed their input token limits[hosseini2024], - **Context window spread**: they may not consistently identify and extract all relevant information, especially when details are scattered across different sections of lengthy protocols[liu2024], - **Query number tradeoff**: on one hand, using a small number of prompts to extract hundreds of independent, individual data elements may lead to suboptimal performance, while on the other hand, using many prompts with such large contexts increases cost and time to completion[lewis2020], - **Lack of element-specific context and referencing**: each data element extracted may require specialized context and prompting, knowledge of clinical research terminology and output requirements[rajpurkar2022]. Practitioners may also need to reference specific protocol sections for auditability and traceability, which standalone LLMs may not provide in a natural way. Retrieval-Augmented Generation (RAG)[lewis2020] addresses these limitations by combining the general knowledge encapsulated in LLMs with element-specific information retrieval queries, context and generation prompts. Since the Schedule of Events (SoE) defines study timing and procedures on which numerous downstream processes depend, it is crucial for operational execution[jscdm2025]. Clinical trial protocols very frequently use table formatting to represent information, most notably the SoE. Furthermore, SoE often involves multi-page spans, intricate cell merging, and hierarchical visit structures encoded through visual layout. Current PDF extraction methods often struggle with those particular challenges, from markup conversion[ferres2018] (cumbersome and labor-intensive) and image-based recognition[zhong2020] (dominant approach, but heavily dependent on trained models) to text and metadata direct extraction (hard to generalize). Because SoE formatting varies widely across protocols, traditional rule-based or metadata-dependent approaches are ruled out, failing to reliably capture the hierarchical relationships. Instead, we address SoE extraction through a specialized two-stage approach involving table detection on protocol pages followed by vision-based multimodal generation on those pages for information extraction. This approach seems sufficient to achieve similar performance levels on SoE than on other information categories. Another challenge requiring specific consideration is the quality evaluation of semi-structured data outputs. Recent work has demonstrated the effectiveness of using LLMs as evaluators, often termed "LLM-as-a-judge", for assessing the quality of AI-generated content. A comprehensive survey shows that in many settings LLM-based evaluation correlates well with human judgments[gu2024]. In healthcare specifically, it has been demonstrated that GPT-4o can effectively automate the evaluation of AI-generated clinical text, achieving strong agreement with expert clinicians while significantly reducing evaluation time and cost[croxford2025]. This approach is particularly valuable in clinical research where human expert evaluation is resource-intensive, yet maintaining quality standards is critical. Building on this foundation, we employ an LLM-based evaluation framework to assess the accuracy and completeness of protocol abstractions in our study. In addition to evaluation being subject to variations in output phrasing, there is also a subjective nature to what constitutes the ground truth, as experts sometimes differ on what constitutes necessary and sufficient extracted information. Moreover, given the document density and breadth of the annotation task, even expert reviewers are prone to incomplete data capture. This means that a small group of expert annotators cannot realistically guarantee a truly exhaustive ground truth. While increasing the number of human reviewers could improve completeness, such a redundant process is prohibitively labor-intensive and fundamentally unscalable. In many recent comparable works ([yuan2025],[wang2024],[thomas2024]), researchers introduce LLMs in their annotation process, both on generating final adjudication candidates and in adjudicating those outputs. Similarly, we designed an LLM-assisted annotation process that presents a hybrid human–AI collaboration to an independent LLM-based adjudication layer whose output is human-reviewed on low-confidence cases and quality-controlled on a randomly sampled subset. ## 3 Methods ### 3.1 Protocol documents selection Starting from all the studies listed in clinicaltrials.gov[ctgov] at the end of March 2025 having an identifiable protocol document, we select interventional drug studies with treatment as primary purpose, conducted in Canada or the United States, from which we randomly sample studies: 9 from oncology, 7 from cardiovascular, and 7 from other therapeutic areas (see Listing 1 (https://arxiv.org/html/2602.00052#LST1) for details). For those studies, a human data annotation expert manually creates a semi-structured dataset from the data models described in Section 3.2 (https://arxiv.org/html/2602.00052#S3.SS2). ### 3.2 Extracted semi-structured data models In order to structure and evaluate the information extraction accuracy of different methodologies, we define a set of semi-structured data models representing key data elements to be extracted from clinical trial protocols. Those data models are designed to capture essential protocol information in a standardized, JSON-representable format that simplifies data handling, comparisons and evaluation. We classify the information to be extracted into six broad categories: general information, inclusion/exclusion criteria, adverse event definitions, intervention, site requirements and schedule of events. Each category is subdivided into smaller data elements representing the relevant (semi-)structured output to be extracted by the RAG invocations, and also serve as a basis for performance evaluation. The list of those data elements is provided in the following subsections and Table 7 (https://arxiv.org/html/2602.00052#A1.T7). #### 3.2.1 General information Elements from this category include general study information (e.g. NCT ID, protocol version, title, sponsor, phase, therapeutic area, disease/condition, allocation, masking, estimated duration) as well as primary and secondary objectives and endpoints (see Listing 3 (https://arxiv.org/html/2602.00052#LST3) for an example). #### 3.2.2 Interventions Elements from this category include trial arms (name, description) and intervention details (type, name, dosage, schedule) as well as treatment-level information (product name, dose, administration, restrictions, modifications) (see Listing 2 (https://arxiv.org/html/2602.00052#LST2) for an example). #### 3.2.3 Schedule of events Elements from this category include visit number, visit time, and procedures to be performed at each visit (see Listing 4 (https://arxiv.org/html/2602.00052#LST4) for an example). #### 3.2.4 Inclusion/exclusion criteria Elements from this category include inclusion criteria and exclusion criteria (see Listing 5 (https://arxiv.org/html/2602.00052#LST5) for an example). #### 3.2.5 Adverse event definitions Elements from this category include adverse event (AE) and serious adverse event (SAE) definitions, severity grading, relationship to study treatment, reporting requirements (timeframes, data collection, contacts), safety monitoring and management (plan, discontinuation criteria, emergency procedures), and specific AE information (expected AEs, potential risks, concomitant medication restrictions, special population considerations) (see Listing 6 (https://arxiv.org/html/2602.00052#LST6) for an example). #### 3.2.6 Site requirements Elements from this category include site equipment, certifications, sample handling requirements as well as investigational product (IP) storage conditions (see Listing 7 (https://arxiv.org/html/2602.00052#LST7) for an example). ### 3.3 Data extraction approaches We use the following three data extraction approaches. #### 3.3.1 RAG extraction Our clinical-trial-specific RAG process works in three key steps, illustrated in Figure 1 (https://arxiv.org/html/2602.0
Similar Articles
Using AI to improve patient access to clinical trials
Paradigm leverages GPT-4's natural language understanding to dramatically improve patient screening for clinical trials, enabling evaluation of hundreds of patients per minute compared to manual review of ~50 per day, reducing clinician burden and improving patient access to treatments.
Pioneering an AI clinical copilot with Penda Health
OpenAI partnered with Penda Health in Kenya to study an LLM-powered clinical copilot called AI Consult, which demonstrated a 16% relative reduction in diagnostic errors and 13% reduction in treatment errors across 39,849 patient visits. The study highlights successful real-world implementation of AI in primary care and provides a template for safe, effective deployment of LLMs to support clinicians.
"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations
CoLabScience introduces a proactive LLM assistant for biomedical research that autonomously intervenes in scientific discussions using PULI (Positive-Unlabeled Learning-to-Intervene), a novel reinforcement learning framework that determines when and how to contribute context-aware insights. The work includes BSDD, a new benchmark dataset of simulated research dialogues with intervention points derived from PubMed articles.
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
This paper introduces AgenticRAG, a framework from Microsoft that enhances enterprise knowledge base retrieval by equipping LLMs with tools for iterative search, document navigation, and analysis. It demonstrates significant improvements in recall and factuality over standard RAG pipelines on multiple benchmarks.
Benchmarking Biology’s AI Agent: ML@B's Collaboration with LatchBio
Machine Learning at Berkeley collaborated with LatchBio to benchmark their AI agent's performance on spatial transcriptomics workflows, evaluating its ability to automate complex bioinformatics tasks.