Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation

arXiv cs.CL Papers

Summary

DPR-BAG is a training-free, zero-shot framework that generates coherent biomedical abstracts from full-text articles by decomposing them into rhetorical facets, summarizing each with an LLM, and refining for coherence, achieving better novelty than baselines while maintaining factual consistency.

arXiv:2605.20628v1 Announce Type: new Abstract: Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at https://huggingface.co/datasets/pmc-mad/PMC-MAD and https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:34 AM

# Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation
Source: [https://arxiv.org/html/2605.20628](https://arxiv.org/html/2605.20628)
Sylvey Lin1,Joe Menke1,Shufan Ming1,Dongin Nam1, Neil Smalheiser1,2,Halil Kilicoglu1,

1School of Information Sciences, University of Illinois Urbana\-Champaign, Champaign, IL 2Department of Psychiatry, University of Illinois College of Medicine, Chicago, IL Correspondence:[yuhsinl2@illinois\.edu](https://arxiv.org/html/2605.20628v1/mailto:[email protected])

###### Abstract

Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery\. However, a non\-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks\. We propose DPR\-BAG \(Divide, Prompt, and Refine for Biomedical Abstract Generation\), a training\-free, zero\-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract\. DPR\-BAG decomposes full\-text documents into structured rhetorical facets following the Background\-Objective\-Methods\-Results\-Conclusions \(BOMRC\) schema, performs parallel LLM\-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence\. On PMC\-MAD, a distribution\-aligned dataset of 46,309 biomedical articles, DPR\-BAG improves abstractive novelty over strong extractive and fine\-tuned baselines, while maintaining factual consistency\. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity\-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies\. These findings underscore the potential of training\-free, structure\-aware frameworks for scalable biomedical abstract generation in low\-resource settings\. Our data and code are available at[https://huggingface\.co/datasets/pmc\-mad/PMC\-MAD](https://huggingface.co/datasets/pmc-mad/PMC-MAD)and[https://github\.com/ScienceNLP\-Lab/MultiTagger\-v2/tree/main/DPR\-BAG](https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG)\.

Divide\-Prompt\-Refine: a Training\-Free, Structure\-Aware Framework for Biomedical Abstract Generation

Sylvey Lin1, Joe Menke1, Shufan Ming1, Dongin Nam1,Neil Smalheiser1,2,Halil Kilicoglu1,1School of Information Sciences, University of Illinois Urbana\-Champaign, Champaign, IL2Department of Psychiatry, University of Illinois College of Medicine, Chicago, ILCorrespondence:[yuhsinl2@illinois\.edu](https://arxiv.org/html/2605.20628v1/mailto:[email protected])

## 1Introduction

Many biomedical NLP tasks rely heavily on abstracts, due to their accessibility and information density\. Abstracts provide an author\-written summary of core scientific findings, making them a useful proxy for full\-text articles in downstream applications\. For example,Luoet al\.\([2022](https://arxiv.org/html/2605.20628#bib.bib24)\)showed that pre\-training tasks designed around the title\-abstract structure improve biomedical information retrieval;Wiegerset al\.\([2025](https://arxiv.org/html/2605.20628#bib.bib26)\)used abstracts as an initial data source for biocuration\. Beyond content, the structured organization of abstracts also benefits downstream tasks:Uedaet al\.\([2021](https://arxiv.org/html/2605.20628#bib.bib25)\)leverage abstract\-level structure to refine retrieval; PubMedQAJinet al\.\([2019](https://arxiv.org/html/2605.20628#bib.bib28)\)is based on structured abstracts to support high\-fidelity biomedical knowledge discovery\. Moreover, abstracts alone can serve as a stronger training signal than full texts in some settingsGuet al\.\([2021](https://arxiv.org/html/2605.20628#bib.bib34)\)\.

However, the absence of abstracts in a significant portion of biomedical articles creates a bottleneck for these tasks\. As of April 2026, 11,603,796 out of 40,414,072 \(~29%\) PubMed articles were missing abstracts, with this volume continuing to rise despite a declining overall proportion, driven by the growth of publication types such as case reports, editorials, and letters\. These publication types carry substantial scientific value\. For instance,Gurulingappaet al\.\([2012](https://arxiv.org/html/2605.20628#bib.bib3)\)andFanet al\.\([2020](https://arxiv.org/html/2605.20628#bib.bib33)\)utilized case reports for adverse drug event detection;Magnet and Carnet \([2006](https://arxiv.org/html/2605.20628#bib.bib1)\)andNuzzo \([2021](https://arxiv.org/html/2605.20628#bib.bib35)\)assessed letters to characterize post\-publication scientific discourse, including patterns of critique, rhetorical features of disagreement, and trends in authorship; andWaaijeret al\.\([2011](https://arxiv.org/html/2605.20628#bib.bib36)\)andIoannidis and Schippers \([2025](https://arxiv.org/html/2605.20628#bib.bib2)\)analyzed editorials to study how journals shape scientific discourse, including the distribution of topics, the framing of policy issues, and the presence of systematic biases\.

The absence of abstracts in these articles motivates the Biomedical Abstract Generation \(BAG\) task, which aims to automatically generate abstracts from full\-text biomedical articles\. Although related to standard document summarization, BAG differs in important ways\. It must adhere to scientific reporting conventions, including structured presentation of methods, results, and conclusions, while preserving fine\-grained biomedical entities, quantitative findings, and explicit argumentative relationships that are often critical for scientific interpretation\. Early BAG work byChachraet al\.\([2016](https://arxiv.org/html/2605.20628#bib.bib37)\)utilized extractive sentence selection, which can lead to fragmented coherence and poor lexical flow\. Moreover, because full\-length biomedical articles often exceed the context limits of standard models, BAG is inherently a long\-context task, making it vulnerable to extractive bias and factual fidelity issues\. For example,Wanget al\.\([2025](https://arxiv.org/html/2605.20628#bib.bib13)\)demonstrates that even state\-of\-the\-art models like GPT\-4 suffer from hallucinations and information omission when extracting from non\-decomposed scientific full texts, emphasizing the inherent fidelity risks in long\-document processing required for BAG task\. Beyond fidelity risks, recent analysis also reveals that when forced to process complex long full texts, even specialized models like LongT5 exhibit a strong extractive bias, relying on simple heuristics to copy verbatim snippets rather than synthesizing informationChernyshev and Dobrov \([2024](https://arxiv.org/html/2605.20628#bib.bib14)\)\. As a result, these models can suffer from the same core issue as traditional extractive summarizers: they produce fragmented text that lacks the natural flow and cohesion of human\-written summariesGiareliset al\.\([2023](https://arxiv.org/html/2605.20628#bib.bib30)\)\.

To address these limitations, we propose the Divide, Prompt and Refine Biomedical Article Generation \(DPR\-BAG\) framework\. Drawing on prior work showing that divide\-and\-conquer decomposition reduces intermediate errors in LLMsZhanget al\.\([2025](https://arxiv.org/html/2605.20628#bib.bib11)\), DPR\-BAG decomposes full\-text articles along their rhetorical structure, performs parallel summarization on each resulting facet, and applies a modular refinement stage to reconcile fragmented outputs and restore discourse coherence\. We target six rhetorical facets: Background, Objectives, Methods, Results, Conclusions \(BOMRC\), and Others\. BOMRC is adopted as it represents the standard discourse structure validated in the PubMed 200k RCT datasetDernoncourt and Lee \([2017](https://arxiv.org/html/2605.20628#bib.bib44)\), while the “Others" facet retains any unclassified content\. Using this design, we focus on three research questions:

1. 1\.Can we develop a training\-free approach for the BAG task?
2. 2\.Does structure\-aware decomposition of full\-text articles improve the quality of generated abstracts compared to naive prompting?
3. 3\.To what extent does increasing prompting complexity \(from detailed instructions to entity guidance\) improve generation quality?

Our main contributions are as follows:

1. 1\.We propose DPR\-BAG, a training\-free, structure\-aware method for BAG\.
2. 2\.We release a dataset of more than 46K biomedical full\-text publications for BAG task\.
3. 3\.We compare DPR\-BAG to strong extractive and abstractive baselines\.
4. 4\.We systematically evaluate the effect of various prompting and splitting strategies as well as entity guidance within DPR\-BAG\.

![Refer to caption](https://arxiv.org/html/2605.20628v1/Abstract_Generation.jpg)Figure 1:Overview of the DPR\-BAG framework for biomedical abstract generation\.
## 2Dataset

We constructed a BAG dataset based on PubMed publications from 1987 to 2023\. To ensure a representative sample, we first calculated the publication type \(PT\) distribution of articles lacking abstracts using PT queries adapted from prior workMenkeet al\.\([2024](https://arxiv.org/html/2605.20628#bib.bib38)\)\. We then performed stratified sampling based on this distribution to retrieve 130,000 candidate XML files from the PubMed Central \(PMC\) Open Access subset, ensuring that the sampled articles reflect the publication type distribution of abstract\-less PubMed records\. For data processing, we adapted the extraction pipeline from the Long\-summarization frameworkCohanet al\.\([2018](https://arxiv.org/html/2605.20628#bib.bib23)\)to parse structured sections and abstracts from the raw XML files\. After filtering out records that were unparseable or lacked extractable abstracts, the final dataset, hereafter referred to as PMC\-MAD \(Missing\-Abstract Distribution\-aligned PMC\), consists of 46,309 articles\.

## 3Methods

DPR\-BAG follows a modular pipeline designed to generate structure\-aware abstracts from biomedical full\-text articles \(Figure[1](https://arxiv.org/html/2605.20628#S1.F1)\)\. The process begins by decomposing the document into five distinct facets based on BOMRC, plus an "Others" facet for unclassified content\. For each facet, we perform parallel LLM\-based summarization, which can optionally be augmented with entity guidance extension\. The resulting sectional summaries are then concatenated and passed to a final LLM\-based refinement stage to restore discourse coherence\. DPR\-BAG requires no task\-specific training or fine\-tuning; all components operate in a zero\-shot manner using pre\-trained, off\-the\-shelf models\.

### 3\.1Task Formulation

Given a full\-text biomedical articleD=\(p1,p2,…,pn\)D=\(p\_\{1\},p\_\{2\},\\ldots,p\_\{n\}\), where eachpip\_\{i\}denotes a paragraph, the goal is to generate an abstractAAthat covers a predefined set of rhetorical facetsℱ=\{\\mathcal\{F\}=\\\{Background, Objectives, Methods, Results, Conclusions, Others\}\\\}\(BOMRC\+\)\.

We reformulate this as a facet\-conditioned summarization problem, where the model generates content for each rhetorical facet separately\. This decomposition allows the model to address each rhetorical component independently and capture discourse structure\. Specifically, the document is partitioned intoK=6K=6facet\-specific sub\-documents\{Dfk\}k=1K\\\{D\_\{f\_\{k\}\}\\\}\_\{k=1\}^\{K\}, where eachDfkD\_\{f\_\{k\}\}aggregates paragraphs rhetorically aligned with facetfkf\_\{k\}\. Each sub\-document is independently summarized to produce a facet summarya^fk\\hat\{a\}\_\{f\_\{k\}\}, and the concatenationA^=⨁k=1Ka^fk\\hat\{A\}=\\bigoplus\_\{k=1\}^\{K\}\\hat\{a\}\_\{f\_\{k\}\}is subsequently refined into the final abstractR​\(A^\)=AR\(\\hat\{A\}\)=A\. Facets absent from the source document yield empty strings\.

### 3\.2Document Splitting

To segment full\-text documents into rhetorically coherent texts, we use LLM\-SSCLanet al\.\([2024](https://arxiv.org/html/2605.20628#bib.bib18)\), an LLM\-based sequential sentence classification framework that assigns rhetorical labels \(BOMRC\) to sentences using in\-context learning \(performance details in Appendix[E](https://arxiv.org/html/2605.20628#A5)\)\. While the model was trained on structured abstracts, we assume that the underlying rhetorical intent \(e\.g\., methodological description vs\. result reporting\) remains consistent within full\-text paragraphs\. Specifically, we leverage the role of the first sentence of the paragraph as a topic sentence that typically encapsulates the paragraph’s functional purpose\. We assign the label of the first sentence of each paragraph as the global label for that paragraph, subsequently concatenating all paragraphs with matching labels to form the input document facets\. We refer to this approach as theFirst Sentence Labeling \(FS\)strategy, and empirically validate it against naive splitting \(NS\) and section\-header \(SH\) ablation variants below\.

##### Naive Splitting Approach \(NS\):

This approach distributes paragraphs into six segments, aiming for an approximately even distribution while maintaining paragraph integrity\. Including this baseline allows us to assess whether a semantic\-aware division \(e\.g\., LLM\-SSC\) offers advantages over a purely structural, length\-based partition\.

##### Section\-header Normalization \(SH\):

This strategy serves as a coarse\-grained semantic baseline\. Utilizing the Transformer model developed inLinet al\.\([2025](https://arxiv.org/html/2605.20628#bib.bib17)\), this approach categorizes existing section headers into standard BOMRC categories and concatenates paragraphs within the same facet\. This comparison helps determine if the fine\-grained, sentence\-level classification used in LLM\-SSC provides additional utility beyond simple section\-level organization\.

### 3\.3Parallel Summarization

After the documents are divided into six document facets \(BOMRC and Others\), each is input into the LLM to generate a corresponding facet summary; facets that are not present in the source document are represented as empty summaries\. These summaries are subsequently concatenated to form the draft abstract\. The following subsections detail the prompting strategies and optional entity guidance extensions used during summarization\.

#### 3\.3\.1Prompting Strategies

To investigate the effect of prompt complexity on generation quality, we adopt aBasic Concise \(BC\)prompting strategy as the baseline, and ablate prompting complexity by evaluating two more elaborate variants,Detailed Instruction \(DI\)andStructural Instruction \(SI\)\. Full prompt templates are described in Appendix[A](https://arxiv.org/html/2605.20628#A1)\.

##### Basic Concise Prompting \(BC\):

BC is a minimal prompting strategy with coarse\-grained focal points for each rhetorical facet \(e\.g\., directing the model to “prioritize key findings and data” for the Results section\) without further elaboration or explicit formatting structure\.

##### Detailed Instruction Prompt \(DI\):

DI is a more detailed prompting strategy modeled after the abstract submission guidelines of JMIR Publications111[https://support\.jmir\.org/hc/en\-us/articles/37982552280987\-Submitting\-Your\-Manuscript\-to\-JMIR\-Publications\-A\-Guide\-for\-Authors](https://support.jmir.org/hc/en-us/articles/37982552280987-Submitting-Your-Manuscript-to-JMIR-Publications-A-Guide-for-Authors)whose five\-part BOMRC structured guideline aligns with the target rhetorical categories that DPR\-BAG uses\. By shifting the LLM persona to a “biomedical synthesis assistant,” this prompt aims to enforce the extraction of granular details and mandates the inclusion of specific research designs, sample sizes, response rates, and statistical metrics \(such as p\-values and confidence intervals\) to ensure adherence to multifaceted reporting standards\.

##### Structural Instruction Prompt \(SI\):

SI extends the basic prompt by introducing an explicit structural schema using Markdown formatting strategy, inspired byHeet al\.\([2024](https://arxiv.org/html/2605.20628#bib.bib39)\)\. Compared to DI, which focuses on detailed content guidance, SI organizes the prompt into a more structured format, which aims to improve instruction adherence\.

#### 3\.3\.2Entity Guidance

We additionally introduce an optional knowledge\-grounding extension to further enhance semantic fidelity during parallel summarization\. This component extracts key biomedical entities to guide the LLM in summarizing each facet\. We consider two instantiations of this extension, TR\-UMLS and CoT, detailed below\.

##### TextRank and UMLS normalization \(TR\-UMLS\):

We extract key phrases from each facet using TextRank, a graph\-based unsupervised method that requires no additional training and is well\-suited to our zero\-shot setting, and link them to UMLSBodenreider \([2004](https://arxiv.org/html/2605.20628#bib.bib46)\)concepts using scispaCy’s UMLS entity linkerNeumannet al\.\([2019](https://arxiv.org/html/2605.20628#bib.bib45)\), retaining only phrases with valid UMLS mappings\. When multiple phrases map to the same UMLS concept, they are grouped together and represented by a single term to avoid redundant anchoring\. The top\-nnentities, ranked by TextRank centrality, are incorporated into the summarization prompt as anchor terms to guide the model toward the most structurally significant medical information in the text\. We ablaten∈\{5,10\}n\\in\\\{5,10\\\}in Section[5\.2](https://arxiv.org/html/2605.20628#S5.SS2)\.

##### Chain\-of\-Thought \(CoT\):

As an alternative, we employ a two\-stage prompting strategy within the LLM summarization module when the extension is activated\. In the first stage, we prompt the model to list the important entities of the facet, and in the second stage, the model is prompted to synthesize the information based on the facet and the entities it listed in the first stage\.

We ablate these two entity guidance strategies independently\. TR\-UMLS pairs with DI, while CoT pairs with SI, whose structured format makes it well\-suited for CoT’s two\-stage reasoning\.

### 3\.4Validation and Fallback

To ensure the summarization pipeline robustness, we implement a validation and fallback mechanism\. If all six facet summaries are empty due to insufficient context or LLM output format violations, the facets are regrouped into three broader categories \(Intro, Main Idea, and Results & Conclusions\) and sent back to the parallel summarization module\. If regrouping fails, a first 300 characters \(Lead\-300\) heuristic backup is used to guarantee non\-empty output \(details in Appendix[B](https://arxiv.org/html/2605.20628#A2)\)\.

### 3\.5Refinement

Upon the formation of the draft abstract, we prompt the LLM to perform a global refinement to ensure structural coherence and stylistic consistency\. This final processing step synthesizes the concatenated facets into a unified Final Abstract \(the refinement prompt is detailed in Appendix[C](https://arxiv.org/html/2605.20628#A3)\)\.

Table 1:Performance comparison on primary evaluation dimensions\. For Abstractiveness metrics, parenthetical values indicate the difference from the original abstract; best = smallest absolute deviation\. Factuality scores are computed against the source full\-text; higher is better\. Best results bolded\. \(Bi\-g: Bigram novelty; Tri\-g: Trigram novelty; Dens\.: Density; AS: AlignScore; MC: MiniCheck\)Table 2:Performance comparison on semantic alignment and supporting metrics\. For Coverage and Compression, parenthetical values indicate the difference from the original abstract; best = smallest absolute deviation\. BS, SENT, R\-L, and U\-R: higher is better\. FOCUS: lower is better\. Best results bolded\. \(BS: BERTScore F1; SENT: DS\_SENT\_NN; FOCUS: DS\_FOCUS\_NN; R\-L: ROUGE\-L; U\-R: UMLS Recall; Cov\.: Coverage; Comp\.: Compression\)

## 4Experimental Setup

DPR\-BAG is implemented withLlama\-3\.2:3Bdeployed via Ollama in instruction\-tuned mode\. Hardware details, token\-limit constraints, and fine\-tuning hyperparameters are provided in Appendix[D](https://arxiv.org/html/2605.20628#A4)\.

### 4\.1Baseline Models

To establish robust baselines for our framework, we compared our approach against several standard long\-document summarization models\. We utilized two off\-the\-shelf variants of the Longformer Encoder\-Decoder \(LED\) architectureBeltagyet al\.\([2020](https://arxiv.org/html/2605.20628#bib.bib22)\)—pretrained on arXiv222[https://huggingface\.co/allenai/led\-large\-16384\-arxiv](https://huggingface.co/allenai/led-large-16384-arxiv)and PubMed333[https://huggingface\.co/patrickvonplaten/led\-large\-16384\-pubmed](https://huggingface.co/patrickvonplaten/led-large-16384-pubmed), respectively—to evaluate their zero\-shot transferability to our task\. Additionally, we included an off\-the\-shelf LongT5 modelGuoet al\.\([2022](https://arxiv.org/html/2605.20628#bib.bib32)\)pretrained on PubMed444[https://huggingface\.co/Stancld/longt5\-tglobal\-large\-16384\-pubmed\-3k\_steps](https://huggingface.co/Stancld/longt5-tglobal-large-16384-pubmed-3k_steps)to broaden our baseline comparisons across different architectures\. Finally, to ensure maximal adaptation to our corpus, we evaluated a supervised fine\-tuned version of the PubMed\-pretrained LED and LongT5 using our 80% training split, using the 10% validation split for early stopping\. All evaluations were performed on the remaining 10% \(test split\)\.

### 4\.2Evaluation Metrics

To assess the generated abstracts, we employ a multi\-dimensional suite of metrics, with particular emphasis on abstractiveness and factuality\. Detailed formulations and implementation details of each metric are provided in Appendix[I](https://arxiv.org/html/2605.20628#A9)\. Paired bootstrap significance tests for the main comparisons are provided in Appendix[J](https://arxiv.org/html/2605.20628#A10)\.

##### Abstractiveness:

Bigram and trigram noveltymeasure the proportion of tokens absent from the source text, serving as a proxy for abstractive synthesis\. We additionally adoptDensityfrom Newsroom\(Gruskyet al\.,[2018](https://arxiv.org/html/2605.20628#bib.bib20)\), which quantifies the length of verbatim copying from the source, offering a complementary view of the model’s extractive behavior\. For both, smaller absolute deviations from the human\-written reference indicate closer stylistic alignment\.

##### Factuality:

Given that our target abstracts are expected to be highly abstractive, factuality metrics must remain reliable under heavy paraphrasing\. We therefore adoptAlignScore\(Zhaet al\.,[2023](https://arxiv.org/html/2605.20628#bib.bib41)\)as our primary factuality measure, as it evaluates alignment against the source full\-text across a broad range of dimensions \(notably paraphrasing\), making it more robust than purely entailment\-based alternatives\. We additionally reportSummaC\(Labanet al\.,[2022](https://arxiv.org/html/2605.20628#bib.bib40)\)andMiniCheck\(Tanget al\.,[2024](https://arxiv.org/html/2605.20628#bib.bib42)\), both NLI\-based metrics, as cross\-checks\.

##### Semantic Alignment:

We adoptBERTScore\(Zhanget al\.,[2020](https://arxiv.org/html/2605.20628#bib.bib48)\)to evaluate semantic similarity between generated abstracts and the reference abstracts via contextual embeddings\. To further assess discourse coherence, we adoptDiscoScore\(Zhaoet al\.,[2023](https://arxiv.org/html/2605.20628#bib.bib19)\), reporting DS\_SENT\_NN \(sentence\-level structural alignment\) and DS\_FOCUS\_NN \(shared noun semantic alignment\)\.

##### Supporting Metrics:

We additionally adoptROUGE\-L\(Lin,[2004](https://arxiv.org/html/2605.20628#bib.bib47)\)to measure n\-gram overlap with the reference\.UMLS Recallquantifies the proportion of UMLS\(Bodenreider,[2004](https://arxiv.org/html/2605.20628#bib.bib46)\)concepts from the original abstract retained in the generated output\.CoverageandCompressionfrom Newsroom\(Gruskyet al\.,[2018](https://arxiv.org/html/2605.20628#bib.bib20)\)serve as complementary metrics to measure how source content is preserved and condensed\. For these two metrics, as well, smaller absolute deviations from the human\-written reference are preferred\.

Table 3:Ablation on document splitting strategies\. Splitting strategies are Naive Splitting \(NS\), Section Header Normalization \(SH\), and First Sentence Labeling \(FS\)\. \(Abbreviations as in Table[1](https://arxiv.org/html/2605.20628#S3.T1)and Table[2](https://arxiv.org/html/2605.20628#S3.T2)\)Table 4:Ablation on summarization prompt strategies\. Prompt variants are Basic Concise \(BC\), Detailed Instruction \(DI\), and Structural Instruction \(SI\)\. \(Abbreviations as in Table[1](https://arxiv.org/html/2605.20628#S3.T1)and Table[2](https://arxiv.org/html/2605.20628#S3.T2)\)

## 5Results

### 5\.1RQ1: Effectiveness of the Training\-Free Approach

We first evaluate whether the proposed training\-free approach can achieve competitive performance on the BAG task\. As shown in Table[1](https://arxiv.org/html/2605.20628#S3.T1)and Table[2](https://arxiv.org/html/2605.20628#S3.T2), DPR\-BAG \(BC prompt, no entity guidance extension\) achieves competitive performance against fine\-tuned baselines\. It generates abstracts that more closely resemble human\-written abstracts \(more abstractive\), while maintaining factual consistency with the full text \(Table[1](https://arxiv.org/html/2605.20628#S3.T1)\)\.

DPR\-BAG outperforms the fine\-tuned baselines in both abstractiveness and factuality \(all paired bootstrapp<0\.001p<0\.001on Bigram and Trigram novelty, Density, AlignScore, and MiniCheck; Appendix[J\.2](https://arxiv.org/html/2605.20628#A10.SS2)\), while the fine\-tuned baselines themselves perform better than other baselines in abstractiveness but their generations are less factual\.

DPR\-BAG yields mostly lower scores for semantic alignment and other supporting metrics compared to the baselines, reflecting a known bias in these metrics toward extractive outputs \(Table[2](https://arxiv.org/html/2605.20628#S3.T2)\)\. Baseline models exhibit high density and low novelty relative to human\-written abstracts \(Table[1](https://arxiv.org/html/2605.20628#S3.T1)\), indicating extractive behavior that likely inflates their scores\. At the same time, DPR\-BAG yields the highest compression rate \(50\.037\), potentially leading to over\-simplification of key information\. We provide qualitative examples of generated abstracts in Appendix[K](https://arxiv.org/html/2605.20628#A11)\.

### 5\.2RQ2: Impact of Structure\-Aware Decomposition

We next evaluate whether structure\-aware decomposition improves generation quality compared to naive prompting\. As shown in Table[3](https://arxiv.org/html/2605.20628#S4.T3), FS achieves the best overall performance\. Compared to NS, FS achieves significantly higher AlignScore \(paired diff = \+0\.006,p<0\.05p<0\.05\), with no significant difference on MiniCheck or SummaC\. Both approaches have similar abstractiveness, with NS achieving scores marginally closer to the human\-written abstracts on Trigram novelty\. Comparing FS with SH further underscores the necessity of fine\-grained local context: SH significantly degrades AlignScore \(\-0\.126\), MiniCheck \(\-0\.099\), and SummaC \(\-0\.141\), allp<0\.001p<0\.001, indicating that broad semantic boundaries provided by section headers fail to provide sufficient contextual anchoring for faithful generation\.

AbstractivenessFactualitySemantic AlignmentConfigTri\-gDens\.ASMCBSFOCUS↓\\downarrowU\-RBC0\.605 \(\-0\.058\)4\.690 \(\-1\.960\)0\.7620\.8900\.6170\.8280\.266DI0\.725 \(\+0\.063\)3\.131 \(\-3\.519\)0\.6420\.8060\.6380\.8850\.247DI\+Top\-50\.725 \(\+0\.062\)3\.138 \(\-3\.512\)0\.6440\.8070\.6370\.9330\.245DI\+Top\-100\.677 \(\+0\.014\)2\.938 \(\-3\.712\)0\.6440\.8110\.5950\.9150\.231Table 5:Ablation on UMLS entity guidance variants\. Top\-nndenotes the number of top\-ranked UMLS entities injected into the DI prompt\. \(Abbreviations as in Table[1](https://arxiv.org/html/2605.20628#S3.T1)and Table[2](https://arxiv.org/html/2605.20628#S3.T2)\)AbstractivenessFactualitySemantic AlignmentConfigTri\-gDens\.ASMCBSFOCUS↓\\downarrowU\-RBC0\.605 \(\-0\.058\)4\.690 \(\-1\.960\)0\.7620\.8900\.6170\.8280\.266SI0\.736 \(\+0\.073\)3\.049 \(\-3\.601\)0\.6300\.8050\.6360\.8360\.251SI \+ CoT0\.749 \(\+0\.086\)2\.940 \(\-3\.710\)0\.5960\.7920\.6310\.7940\.249Table 6:Ablation on Chain\-of\-Thought guidance for the SI prompt\. \(Abbreviations as in Table[1](https://arxiv.org/html/2605.20628#S3.T1)and Table[2](https://arxiv.org/html/2605.20628#S3.T2)\)
### 5\.3RQ3: Impact of Prompting Strategy and Entity Guidance

Next, we examine the effect of increasing prompt complexity\.

##### Prompting Strategy:

While DI and SI yield marginal but significant improvements in BERTScore \(p<0\.001p<0\.001\), both suffer from degradation in AlignScore, MiniCheck, SummaC, and UMLS Recall compared to BC \(allp<0\.001p<0\.001; Table[4](https://arxiv.org/html/2605.20628#S4.T4)\)\. This suggests that dense, multi\-faceted instructions might have introduced a distraction effect, where the model struggles to simultaneously satisfy formatting constraints and maintain source\-grounded factual alignment, motivating the need for explicit reasoning pathways and external grounding\.

##### Entity Guidance:

To assess whether external grounding can recover the factual consistency degradation observed in DI and SI, we ablate two entity guidance approaches: TR\-UMLS integrated into DI, and CoT integrated into SI, with BC \(no entity guidance\) serving as the reference baseline\.

As shown in Table[5](https://arxiv.org/html/2605.20628#S5.T5), applying TR\-UMLS to DI does not improve the overall performance\. Top\-5 and top\-10 variants show no significant change in AlignScore relative to the base DI prompt, and both remain significantly below the BC baseline \(p<0\.001p<0\.001\)\. DS\-Focus score also increases under top\-5 entities \(p<0\.01p<0\.01\), suggesting that explicit entity conditioning might cause the model to over\-prioritize the provided terms, exacerbating the distraction effect rather than serving as an effective grounding mechanism\.

Table[6](https://arxiv.org/html/2605.20628#S5.T6)shows that integrating CoT into the SI prompt improves semantic alignment with the original abstract compared to BC \(BERTScore \+0\.015,p<0\.001p<0\.001\) while yielding overall higher abstractiveness and lower factuality \(AlignScore \-0\.166, MiniCheck \-0\.098, SummaC \-0\.158, allp<0\.001p<0\.001\)\. This shows that CoT encourages more abstractive and novel phrasing, at the expense of reducing the factual overlap with the source\.

## 6Discussion

Our results show that the proposed training\-free approach generates abstracts that more closely resemble human\-written abstracts in terms of abstractiveness and factuality\. Notably, models fine\-tuned on PMC\-MAD do not match this performance on these dimensions\. However, these fine\-tuned models achieve slightly better semantic alignment, lexical/semantic overlap, and compression than DPR\-BAG\. Since the baseline models do not explicitly model rhetorical structure, we attribute the stronger performance of DPR\-BAG on abstractiveness and factuality to its structure\-aware design\. Specifically, the instruction\-tuned LLM backbone provides a baseline tendency toward natural, paraphrased output over verbatim copying\. This tendency is amplified by the decompose\-then\-refine design: partitioning full\-text articles into facet\-specific sub\-documents allows each summarization step to operate over shorter, topically coherent contexts, reducing verbatim copying under long\-context pressure\. The subsequent refinement stage then re\-synthesizes the concatenated facet summaries, further encouraging paraphrasing over fragment assembly\.

The distraction effect observed under DI and SI prompts indicates that instruction complexity can actively harm factual grounding in small LLMs\. Similarly, entity guidance via UMLS only seemed to increase instruction complexity, yielding little positive effect\. The inconsistent gains of SI\+CoT further suggest that publication\-type distribution can influence generation behavior\. This motivates publication\-type\-aware prompting as a future direction\. Additional BC\-prompt ablations corroborate these findings for TR\-UMLS \(Appendix[H](https://arxiv.org/html/2605.20628#A8)\)\.

To evaluate the generalizability of DPR\-BAG, we applied it to the PubMed Summarization datasetCohanet al\.\([2018](https://arxiv.org/html/2605.20628#bib.bib23)\)\. Unlike PMC\-MAD, this dataset is not stratified to reflect the publication type distribution of abstract\-less PubMed articles\. Overall, the trends observed on PMC\-MAD, particularly with respect to semantic alignment, abstractiveness, and factuality, largely persist on this dataset\. Some differences are also observed \(lower MiniCheck scores and better performance of SI\+CoT relative to BC\)\. Despite these variations, the results indicate that DPR\-BAG maintains robust performance across datasets, supporting its generalizability \(Appendix[F](https://arxiv.org/html/2605.20628#A6)\)\.

To investigate whether larger model sizes can improve generation quality, we also evaluatedQwen2\.5:7BandQwen2\.5:14Bunder the BC prompt\. These models achieve improvements in semantic alignment and compression rates; however, they also yield excessive abstractiveness and lower factuality, overall yielding little advantage over the base 3B model \(Appendix[G](https://arxiv.org/html/2605.20628#A7)\)\.

### 6\.1Limitations

Several limitations apply\. First, our evaluation relies primarily on automated metrics, which vary in their robustness\. Human evaluation is needed for complementary validation\. Second, the BOMRC schema may be suboptimal for articles with non\-standard discourse structure\. More adaptive decomposition strategies remain an open direction\. Finally, DPR\-BAG’s summaries exhibit substantially higher compression than human\-written abstracts, indicating considerably terser outputs\. This excessive compression risks omitting auxiliary but informative content such as background, secondary findings, or caveats\. Calibrating facet\-level length targets is a natural direction for future work\.

## 7Related Work

For the BAG task,Chachraet al\.\([2016](https://arxiv.org/html/2605.20628#bib.bib37)\)integrated domain\-specific classifiers and entailment graphs for extractive sentence selection, inheriting the disjointed flow typical of pure extractive approaches\.Sybrandt and Safro \([2021](https://arxiv.org/html/2605.20628#bib.bib43)\)proposed CBAG, which generates abstracts conditioned on author\-provided MeSH keywords rather than full\-text articles, leaving long\-context challenges unaddressed\.

To generate abstractive summaries from full\-text articles, DANCERGidiotis and Tsoumakas \([2020](https://arxiv.org/html/2605.20628#bib.bib15)\)and IDCUOTShen and Lam \([2022](https://arxiv.org/html/2605.20628#bib.bib16)\)pioneered the strategy of breaking long documents into manageable sections to bypass context window limitations\. However, these supervised approaches rely on heuristic alignment algorithms to map abstract sentences back to source sections and generate summaries for each section in isolation, making them prone to error propagation and fragmented coherence across sections\.

GenCompareSumBishopet al\.\([2022](https://arxiv.org/html/2605.20628#bib.bib31)\)leverages similar divide\-and conquer principles but remains extractive\-heavy: abstractive fragments serve only as anchors for sentence selection, resulting in discontinuous summaries that lack the narrative transitions typical of human\-written abstracts\.

## 8Conclusion

We presented DPR\-BAG, a training\-free, rhetorical structure\-aware, divide\-and\-conquer framework for biomedical abstract generation\. The method decomposes full\-text articles into semantic facets and applies parallel LLM\-based summarization followed by refinement\. Across both the PMC\-MAD and PubMed Summarization datasets, DPR\-BAG produces abstracts that most closely match human\-written abstracts in terms of abstractiveness and factual consistency, without task\-specific training and within standard hardware constraints\. The framework can be integrated as a preprocessing component into pipelines that require biomedical abstracts, thereby improving downstream performance\.

## References

- Longformer: the long\-document transformer\.External Links:2004\.05150,[Link](https://arxiv.org/abs/2004.05150)Cited by:[§4\.1](https://arxiv.org/html/2605.20628#S4.SS1.p1.1)\.
- J\. Bishop, Q\. Xie, and S\. Ananiadou \(2022\)GenCompareSum: a hybrid unsupervised summarization method using salience\.InProceedings of the 21st Workshop on Biomedical Language Processing,D\. Demner\-Fushman, K\. B\. Cohen, S\. Ananiadou, and J\. Tsujii \(Eds\.\),Dublin, Ireland,pp\. 220–240\.External Links:[Link](https://aclanthology.org/2022.bionlp-1.22/),[Document](https://dx.doi.org/10.18653/v1/2022.bionlp-1.22)Cited by:[§7](https://arxiv.org/html/2605.20628#S7.p3.1)\.
- O\. Bodenreider \(2004\)The Unified Medical Language System \(UMLS\): integrating biomedical terminology\.Nucleic Acids Research32\(Database issue\),pp\. D267–D270\.External Links:[Document](https://dx.doi.org/10.1093/nar/gkh061)Cited by:[§3\.3\.2](https://arxiv.org/html/2605.20628#S3.SS3.SSS2.Px1.p1.2),[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px4.p1.1)\.
- S\. Chachra, A\. Ben Abacha, S\. Shooshan, L\. Rodriguez, and D\. Demner\-Fushman \(2016\)A hybrid approach to generation of missing abstracts in biomedical literature\.InProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers,Y\. Matsumoto and R\. Prasad \(Eds\.\),Osaka, Japan,pp\. 1093–1100\.External Links:[Link](https://aclanthology.org/C16-1104/)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p3.1),[§7](https://arxiv.org/html/2605.20628#S7.p1.1)\.
- D\. Chernyshev and B\. Dobrov \(2024\)Investigating the pre\-training bias in low\-resource abstractive summarization\.IEEE Access12\(\),pp\. 47219–47230\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2024.3379139)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p3.1)\.
- A\. Cohan, I\. Beltagy, D\. King, B\. Dalvi, and D\. Weld \(2019\)Pretrained language models for sequential sentence classification\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3693–3699\.External Links:[Link](https://aclanthology.org/D19-1383/),[Document](https://dx.doi.org/10.18653/v1/D19-1383)Cited by:[Appendix E](https://arxiv.org/html/2605.20628#A5.p1.1)\.
- A\. Cohan, F\. Dernoncourt, D\. S\. Kim, T\. Bui, S\. Kim, W\. Chang, and N\. Goharian \(2018\)A discourse\-aware attention model for abstractive summarization of long documents\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\),New Orleans, Louisiana,pp\. 615–621\.External Links:[Link](https://aclanthology.org/N18-2097),[Document](https://dx.doi.org/10.18653/v1/N18-2097)Cited by:[Appendix F](https://arxiv.org/html/2605.20628#A6.p1.1),[§2](https://arxiv.org/html/2605.20628#S2.p1.1),[§6](https://arxiv.org/html/2605.20628#S6.p3.1)\.
- F\. Dernoncourt and J\. Y\. Lee \(2017\)PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts\.InProceedings of the Eighth International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\),G\. Kondrak and T\. Watanabe \(Eds\.\),Taipei, Taiwan,pp\. 308–313\.External Links:[Link](https://aclanthology.org/I17-2052/)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p4.1)\.
- B\. Fan, W\. Fan, C\. Smith, and H\. “\. Garner \(2020\)Adverse drug event detection and extraction from open data: a deep learning approach\.Information Processing & Management57\(1\),pp\. 102131\.External Links:ISSN 0306\-4573,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ipm.2019.102131),[Link](https://www.sciencedirect.com/science/article/pii/S0306457319308623)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p2.1)\.
- N\. Giarelis, C\. Mastrokostas, and N\. Karacapilidis \(2023\)Abstractive vs\. extractive summarization: an experimental review\.Applied Sciences13\(13\)\.External Links:[Link](https://www.mdpi.com/2076-3417/13/13/7620),ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app13137620)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p3.1)\.
- A\. Gidiotis and G\. Tsoumakas \(2020\)A divide\-and\-conquer approach to the summarization of long documents\.IEEE/ACM Transactions on Audio, Speech, and Language Processing28\(\),pp\. 3029–3040\.External Links:[Document](https://dx.doi.org/10.1109/TASLP.2020.3037401)Cited by:[§7](https://arxiv.org/html/2605.20628#S7.p2.1)\.
- M\. Grusky, M\. Naaman, and Y\. Artzi \(2018\)Newsroom: a dataset of 1\.3 million summaries with diverse extractive strategies\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 708–719\.External Links:[Link](https://aclanthology.org/N18-1065/),[Document](https://dx.doi.org/10.18653/v1/N18-1065)Cited by:[§I\.3](https://arxiv.org/html/2605.20628#A9.SS3.SSS0.Px1),[§I\.3](https://arxiv.org/html/2605.20628#A9.SS3.SSS0.Px1.p1.4),[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px4.p1.1)\.
- Y\. Gu, R\. Tinn, H\. Cheng, M\. Lucas, N\. Usuyama, X\. Liu, T\. Naumann, J\. Gao, and H\. Poon \(2021\)Domain\-specific language model pretraining for biomedical natural language processing\.ACM Trans\. Comput\. Healthcare3\(1\)\.External Links:[Link](https://doi.org/10.1145/3458754),[Document](https://dx.doi.org/10.1145/3458754)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p1.1)\.
- M\. Guo, J\. Ainslie, D\. Uthus, S\. Ontañón, J\. Ni, Y\. Sung, and Y\. Yang \(2022\)LongT5: Efficient text\-to\-text transformer for long sequences\.InFindings of the Association for Computational Linguistics: NAACL 2022,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 724–736\.External Links:[Link](https://aclanthology.org/2022.findings-naacl.55/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-naacl.55)Cited by:[§4\.1](https://arxiv.org/html/2605.20628#S4.SS1.p1.1)\.
- H\. Gurulingappa, A\. M\. Rajput, A\. Roberts, J\. Fluck, M\. Hofmann\-Apitius, and L\. Toldo \(2012\)Development of a benchmark corpus to support the automatic extraction of drug\-related adverse effects from medical case reports\.Journal of biomedical informatics45\(5\),pp\. 885–892\.Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p2.1)\.
- J\. He, M\. Rungta, D\. Koleczek, A\. Sekhon, F\. X\. Wang, and S\. Hasan \(2024\)Does prompt formatting have any impact on LLM performance?\.External Links:2411\.10541,[Link](https://arxiv.org/abs/2411.10541)Cited by:[§3\.3\.1](https://arxiv.org/html/2605.20628#S3.SS3.SSS1.Px3.p1.1)\.
- J\. P\. Ioannidis and M\. C\. Schippers \(2025\)In\-house editorials and journalistic pieces comprise a massive corpus in the scientific literature that can be improved\.European Journal of Clinical Investigation55\(8\),pp\. e70061\.Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p2.1)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. Cohen, and X\. Lu \(2019\)PubMedQA: a dataset for biomedical research question answering\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 2567–2577\.Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p1.1)\.
- P\. Laban, T\. Schnabel, P\. N\. Bennett, and M\. A\. Hearst \(2022\)SummaC: re\-visiting NLI\-based models for inconsistency detection in summarization\.Transactions of the Association for Computational Linguistics10,pp\. 163–177\.External Links:ISSN 2307\-387X,[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00453),[Link](https://doi.org/10.1162/tacl_a_00453),https://direct\.mit\.edu/tacl/article\-pdf/doi/10\.1162/tacl\_a\_00453/1987014/tacl\_a\_00453\.pdfCited by:[§I\.1](https://arxiv.org/html/2605.20628#A9.SS1.SSS0.Px3),[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px2.p1.1)\.
- M\. Lan, L\. Zheng, S\. Ming, and H\. Kilicoglu \(2024\)Multi\-label sequential sentence classification via large language model\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 16086–16104\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.944/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.944)Cited by:[Appendix E](https://arxiv.org/html/2605.20628#A5.p1.1),[§3\.2](https://arxiv.org/html/2605.20628#S3.SS2.p1.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px4.p1.1)\.
- S\. Lin, J\. Menke, A\. Holt, H\. Kilicoglu, and N\. Smalheiser \(2025\)Section header normalization in biomedical articles using transformers\.InAMIA Annual Symposium Proceedings,Note:Poster P116External Links:[Link](https://amia.secure-platform.com/symposium/gallery/rounds/82021/details/19558)Cited by:[§3\.2](https://arxiv.org/html/2605.20628#S3.SS2.SSS0.Px2.p1.1)\.
- M\. Luo, A\. Mitra, T\. Gokhale, and C\. Baral \(2022\)Improving biomedical information retrieval with neural retrievers\.Proceedings of the AAAI Conference on Artificial Intelligence36\(10\),pp\. 11038–11046\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/21352),[Document](https://dx.doi.org/10.1609/aaai.v36i10.21352)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p1.1)\.
- A\. Magnet and D\. Carnet \(2006\)Letters to the editor: still vigorous after all these years?: a presentation of the discursive and linguistic features of the genre\.English for Specific Purposes25\(2\),pp\. 173–199\.Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p2.1)\.
- J\. D\. Menke, H\. Kilicoglu, and N\. R\. Smalheiser \(2024\)Publication type tagging using transformer models and multi\-label classification\.AMIA Annual Symposium Proceedings2024,pp\. 818–827\.Cited by:[§2](https://arxiv.org/html/2605.20628#S2.p1.1)\.
- M\. Neumann, D\. King, I\. Beltagy, and W\. Ammar \(2019\)ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing\.InProceedings of the 18th BioNLP Workshop and Shared Task,Florence, Italy,pp\. 319–327\.External Links:[Link](https://www.aclweb.org/anthology/W19-5034),[Document](https://dx.doi.org/10.18653/v1/W19-5034),arXiv:1902\.07669Cited by:[§I\.3](https://arxiv.org/html/2605.20628#A9.SS3.p1.2),[§3\.3\.2](https://arxiv.org/html/2605.20628#S3.SS3.SSS2.Px1.p1.2)\.
- J\. L\. Nuzzo \(2021\)Letters to the editor in exercise science and physical therapy journals: an examination of content and “authorship inflation”\.Scientometrics126\(8\),pp\. 6917–6936\.External Links:[Document](https://dx.doi.org/10.1007/s11192-021-04068-w),[Link](https://doi.org/10.1007/s11192-021-04068-w),ISSN 1588\-2861Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p2.1)\.
- A\. See, P\. J\. Liu, and C\. D\. Manning \(2017\)Get to the point: summarization with pointer\-generator networks\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),R\. Barzilay and M\. Kan \(Eds\.\),Vancouver, Canada,pp\. 1073–1083\.External Links:[Link](https://aclanthology.org/P17-1099/),[Document](https://dx.doi.org/10.18653/v1/P17-1099)Cited by:[§I\.3](https://arxiv.org/html/2605.20628#A9.SS3.SSS0.Px2.p1.1)\.
- X\. Shen and W\. Lam \(2022\)Improved divide\-and\-conquer approach to abstractive summarization of scientific papers\.In2022 4th International Conference on Natural Language Processing \(ICNLP\),Vol\.,pp\. 395–398\.External Links:[Document](https://dx.doi.org/10.1109/ICNLP55136.2022.00073)Cited by:[§7](https://arxiv.org/html/2605.20628#S7.p2.1)\.
- J\. Sybrandt and I\. Safro \(2021\)CBAG: conditional biomedical abstract generation\.PLoS One16\(7\),pp\. e0253905\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0253905),[Link](https://doi.org/10.1371/journal.pone.0253905)Cited by:[§7](https://arxiv.org/html/2605.20628#S7.p1.1)\.
- L\. Tang, P\. Laban, and G\. Durrett \(2024\)MiniCheck: efficient fact\-checking of LLMs on grounding documents\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 8818–8847\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.499/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.499)Cited by:[§I\.1](https://arxiv.org/html/2605.20628#A9.SS1.SSS0.Px2),[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px2.p1.1)\.
- A\. Ueda, R\. L\. T\. Santos, C\. Macdonald, and I\. Ounis \(2021\)Structured fine\-tuning of contextual embeddings for effective biomedical retrieval\.InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’21,New York, NY, USA,pp\. 2031–2035\.External Links:ISBN 9781450380379,[Link](https://doi.org/10.1145/3404835.3463075),[Document](https://dx.doi.org/10.1145/3404835.3463075)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p1.1)\.
- C\. J\. F\. Waaijer, C\. A\. van Bochove, and N\. J\. van Eck \(2011\)On the map: nature and science editorials\.Scientometrics86\(1\),pp\. 99–112\.External Links:[Document](https://dx.doi.org/10.1007/s11192-010-0205-9),[Link](https://doi.org/10.1007/s11192-010-0205-9),ISSN 1588\-2861Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p2.1)\.
- T\. Wang, X\. Chen, Q\. Zhu, T\. Guo, S\. Gao, Z\. Lu, X\. Gao, and X\. Zhang \(2025\)New paradigm for evaluating scholar summaries: a facet\-aware metric and a meta\-evaluation benchmark\.ACM Trans\. Inf\. Syst\.43\(4\)\.External Links:ISSN 1046\-8188,[Link](https://doi.org/10.1145/3733597),[Document](https://dx.doi.org/10.1145/3733597)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p3.1)\.
- T\. C\. Wiegers, A\. P\. Davis, J\. Wiegers, D\. Sciaky, F\. Barkalow, B\. Wyatt, M\. Strong, R\. McMorran, S\. Abrar, and C\. J\. Mattingly \(2025\)Integrating AI\-powered text mining from PubTator into the manual curation workflow at the Comparative Toxicogenomics Database\.Database2025,pp\. baaf013\.External Links:ISSN 1758\-0463,[Document](https://dx.doi.org/10.1093/database/baaf013),[Link](https://doi.org/10.1093/database/baaf013),https://academic\.oup\.com/database/article\-pdf/doi/10\.1093/database/baaf013/62047329/baaf013\.pdfCited by:[§1](https://arxiv.org/html/2605.20628#S1.p1.1)\.
- Y\. Zha, Y\. Yang, R\. Li, and Z\. Hu \(2023\)AlignScore: evaluating factual consistency with a unified alignment function\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 11328–11348\.External Links:[Link](https://aclanthology.org/2023.acl-long.634/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.634)Cited by:[§I\.1](https://arxiv.org/html/2605.20628#A9.SS1.SSS0.Px1),[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px2.p1.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with BERT\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkeHuCVFDr)Cited by:[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px3.p1.1)\.
- Y\. Zhang, D\. Cao, L\. Du, Q\. Fu, and Y\. Liu \(2025\)When splitting makes stronger: a theoretical and empirical analysis of divide\-and\-conquer prompting in LLMs\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=rAR7iPI8Kh)Cited by:[§1](https://arxiv.org/html/2605.20628#S1.p4.1)\.
- W\. Zhao, M\. Strube, and S\. Eger \(2023\)DiscoScore: evaluating text generation with BERT and discourse coherence\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,A\. Vlachos and I\. Augenstein \(Eds\.\),Dubrovnik, Croatia,pp\. 3865–3883\.External Links:[Link](https://aclanthology.org/2023.eacl-main.278/),[Document](https://dx.doi.org/10.18653/v1/2023.eacl-main.278)Cited by:[§I\.2](https://arxiv.org/html/2605.20628#A9.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.20628#S4.SS2.SSS0.Px3.p1.1)\.

## Appendix ASummarization Prompt Templates

All prompting strategies share afacet\_guidelinesdictionary that maps each facet label to a facet\-specific instruction\. We use two versions: a concise version \(Table[7](https://arxiv.org/html/2605.20628#A1.T7)\) used in BC and SI prompts, and a detailed version adapted from JMIR author guidelines \(see footnote[1](https://arxiv.org/html/2605.20628#footnote1)\) \(Table[8](https://arxiv.org/html/2605.20628#A1.T8)used in DI prompts\)\. In all prompt templates,<facet\_text\>denotes the input facet text,<facet\_type\>denotes the rhetorical facet label, and<facet\_guide\>denotes the corresponding facet\-specific instruction\.

Table 7:Concise facet\-specific guidelines \(used in BC and SI prompts\)\.Table 8:Detailed facet\-specific guidelines \(used for DI prompts\), adapted from JMIR author guidelines\.### A\.1Basic Concise Prompt \(BC\)

System MessageYou are a biomedical summarization assistant\. 1\. Use a formal, objective, scientific tone\. 2\. Never use meta\-phrases like ’the authors state’ or ’this section describes’\. Respond ONLY with a JSON object: \{"summary": "…", "reasoning": "…"\}\. No markdown, no talk\.

User MessageSummarize this<facet\_type\>sectionSpecific focus:<facet\_guide\>Paragraph text:<facet\_text\>

### A\.2Detailed Instruction Prompt \(DI\)

System MessageYou are a biomedical synthesis assistant\. 1\. Use a formal, objective, scientific tone\.2\. Never use meta\-phrases like ‘the authors state’ or ‘this section describes’\.Define your output strictly as:\-‘summary’: The synthesized biomedical text\.\-‘reasoning’: A brief explanation\.Format: \{"summary": "…", "reasoning": "…"\}\. No markdown, no talk\.

User MessageSynthesize the critical information from this<facet\_type\>section provided in the ‘Paragraph text’<facet\_guide\>Paragraph text:<facet\_text\>

### A\.3Structural Instruction Prompt \(SI\)

System Message\# ROLEYou are an expert Biomedical Summarization Assistant\.\# STRICT GUIDELINES\- \*\*Tone\*\*: Formal, objective, and academic\.\- \*\*No Meta\-Talk\*\*: Do NOT use phrases like ‘The authors state’ or ‘This section describes’\.\- \*\*Output Format\*\*: Respond \*\*ONLY\*\* with a valid JSON object\. No Markdown blocks, no preamble, no postscript\.“‘json\{"reasoning": "Brief explanation of how your summary fulfills the given instructions…", "summary": "Final summary…"\} “‘

User Message\#\# TASK: SUMMARY GENERATION\*\*Target Focus:\*\*<facet\_guide\>\*\*Instructions:\*\* Generate a professional biomedical summary\.—\#\#\# INPUT TEXT \(Reference\)<facet\_text\>

### A\.4BC for Naive Splitting \(BC\-NS\)

When paired with the Naive Splitting baseline, the system message remains unchanged\. As no facet label is assigned, the user message uses a generic focus instruction:

User MessageSummarize this section\.Specific focus: Summarize the main topic\.Paragraph text:<facet\_text\>

### A\.5BC with TR\-UMLS Entity Guidance \(BC\+TR\-UMLS\)

System MessageYou are a biomedical summarization assistant\. 1\. Use a formal, objective, scientific tone\. 2\. Never use meta\-phrases like ’the authors state’ or ’this section describes’\. Respond ONLY with a JSON object: \{"summary": "…", "reasoning": "…"\}\. No markdown, no talk\.

User MessageSummarize this<facet\_type\>sectionSpecific focus:<facet\_guide\> Ensure the core meanings of these key biomedical entities are preserved or synthesized accurately:Paragraph text:<top\_entities\><facet\_text\>

### A\.6DI with TR\-UMLS Entity Guidance \(DI\+TR\-UMLS\)

User MessageSynthesize the critical information from this<facet\_type\>section provided in the ‘Paragraph text’<facet\_guide\>Ensure the core meanings of these key biomedical entities are preserved or synthesized accurately: <top\_entities\>Paragraph text:<facet\_text\>

### A\.7SI with Chain\-of\-Thought \(SI\+CoT\)

System Message\# ROLEYou are an expert Biomedical Summarization Assistant\.—\# OPERATIONAL FRAMEWORKYou must follow this 2\-Stage Chain\-of\-Thought process:1\. Stage 1: Element Extraction \(entities, parameters, methodologies, statistics\)\.2\. Stage 2: Summary Generation \(synthesize into scientific narrative\)\.—\# STRICT GUIDELINES\- \*\*Tone\*\*: Formal, objective, and academic\.\- \*\*No Meta\-Talk\*\*: Do NOT use phrases like ‘The authors state’ or ‘This section describes’\.\- \*\*Output Format\*\*: Respond \*\*ONLY\*\* with a valid JSON object\. No Markdown blocks, no preamble, no postscript\.“‘json\{"reasoning": "Stage 1 extraction results…", "summary": "Stage 2 final summary…"\}“‘

User Message 1 \(Stage 1\)\#\# TASK 1: ELEMENT EXTRACTION\*\*Section Type:\*\*<facet\_type\> \*\*Instructions:\*\* Extract key elements from the text below, including:\- \*\*Entities:\*\* Diseases, genes, drugs, proteins\.\- \*\*Parameters:\*\* Sample sizes, dosage, duration\.\- \*\*Methodology:\*\* Study design, assays, equipment\.\- \*\*Statistics:\*\* P\-values, confidence intervals, effect sizes\.—\#\#\# INPUT TEXT<facet\_text\>

User Message 2 \(Stage 2\)\#\# TASK 2: SUMMARY GENERATION\*\*Target Focus:\*\*<facet\_guide\>\*\*Instructions:\*\* Using the elements extracted in Task 1, generate a professional biomedical summary\.Ensure the summary is dense with information but remains readable and scientifically accurate\.—\#\#\# INPUT TEXT \(Reference\)<facet\_text\>

## Appendix BValidation and Fallback Details

When the fallback process is triggered, the original six facets are regrouped into three broader categories:Intro\(Background and Objective\),Main Idea\(Methods and Others\), andResults & Conclusions\(Results and Conclusions\)\. Background and Objective are merged as both establish the research context and motivation\. Results and Conclusions are paired as both convey findings and their implications\. The remaining Others facet, which retains paragraphs not assigned to any BOMRC category by the sentence classifier, is grouped with Methods by elimination, as the other four facets form more natural rhetorical pairs\. These regrouped facets are then sent back to the parallel summarization module\. If the LLM still fails to summarize a specific regrouped facet, the first 300 characters \(Lead\-300\) of that facet are used as the facet summary to capture core information via lead bias without adding granular noise\. The fallback mechanism was rarely triggered in practice, suggesting that this pairing has minimal impact on overall generation quality\. In a random sample of 300 test articles \(6\.5% of the test set\), the fallback mechanism was never triggered, suggesting with 95% confidence that fewer than 1% of articles require fallback intervention\.

## Appendix CRefinement Prompt

This prompt is used in the final stage to smooth the concatenated facets, ensuring structural coherence and stylistic consistency across the unified abstract\.<draft\_abstract\>denotes the place holder for the concatenated draft abstract\.

System MessageYou are a biomedical abstract refinement assistant\. Refine the abstract based on the abstract draft\.CRITICAL INSTRUCTION:Respond ONLY with a valid JSON object\.Do NOT use Markdown code blocks \(like “‘json\)\.Do NOT provide any conversational text\.Format:\{"abstract": "your abstract text here","reasoning": "your reasoning here"\}

User Messageabstract draft:<draft\_abstract\>

## Appendix DExtended Implementation Details and Token Distribution

All fine\-tuning and evaluation procedures were conducted on an NVIDIA Tesla V100 GPU \(32GB VRAM\)\. While the underlying LED architecture theoretically supports sequences up to 16,384 tokens, processing such lengths on standard hardware is computationally prohibitive, inevitably leading to out\-of\-memory \(OOM\) errors even with minimal batch sizes\. To fit within this memory budget, the fine\-tuned LED\-Pubmed used gradient accumulation with an effective batch size of 8, halted at 500 steps based on validation performance\.

This constraint restricts standard baselines to 8,192\-token inputs\. As illustrated in Figure[2](https://arxiv.org/html/2605.20628#A4.F2), our empirical analysis of the 46,309 articles in the dataset reveals a median of 2,959 tokens and an average length of approximately 6,018 tokens\. While the 8,192\-token capacity successfully accommodates 77\.67% of the dataset, the length distribution exhibits a severe long\-tail characteristic\. Specifically, 22\.33% of the articles exceed this limit, with the longest document reaching an extreme 1,185,139 tokens\. For these extensive studies, standard baselines operating within memory limits are forced to truncate critical information, such as discussion and conclusion sections, which underscores the necessity of the partitioned approach introduced in our DPR\-BAG framework\.

![Refer to caption](https://arxiv.org/html/2605.20628v1/distribution.png)Figure 2:Distribution of document token lengths in the dataset\. The red dashed line denotes the 8,192\-token hardware limit for standard baselines\.
## Appendix ELLM\-SSC

LLM\-SSCLanet al\.\([2024](https://arxiv.org/html/2605.20628#bib.bib18)\)is evaluated on the BIORC800 dataset, a manually annotated multi\-label SSC dataset of biomedical abstracts using the BOMRC schema\. Under task\-specific fine\-tuning, LLM\-SSC achieves a micro F1 of 0\.907 and macro F1 of 0\.912 on BIORC800, outperforming prior SSC baselinesCohanet al\.\([2019](https://arxiv.org/html/2605.20628#bib.bib50)\)\. We adopt LLM\-SSC for our document splitting module as its label schema \(Background, Objective, Methods, Results, Conclusions, and None\) directly aligns with the BOMRC\+ facets in DPR\-BAG, where the None label corresponds to our Others facet\.

## Appendix FPubMedSum Dataset Validation

To evaluate the generalizability of our pipeline, we also tested DPR\-BAG on the PubMed Summarization datasetCohanet al\.\([2018](https://arxiv.org/html/2605.20628#bib.bib23)\), hereafter PubMedSum\. As shown in Figure[4](https://arxiv.org/html/2605.20628#A6.F4), unlike PMC\-MAD, which is stratified to reflect the publication type distribution of abstract\-less PubMed manuscripts, the majority of articles in the PubMedSum could not be mapped to a specific publication type\. This allows us to benchmark the performance of DPR\-BAG against established standards in the biomedical domain and ensure that our findings are not limited to the specific characteristics of the PMC\-MAD corpus\.

![Refer to caption](https://arxiv.org/html/2605.20628v1/token_distribution.png)Figure 3:Token Distribution ComparisonAs illustrated in Figure[3](https://arxiv.org/html/2605.20628#A6.F3), the two datasets exhibit distinct token length profiles\. PMC\-MAD has a median document length of 2959 tokens \(IQR: 1603–7424\), while PubMedSum has a median of 3,166 tokens \(IQR: 1838–4952\)\.

![Refer to caption](https://arxiv.org/html/2605.20628v1/PT_distribution.png)Figure 4:Publication type distribution of PMC\-MAD, PubMed Summarization Dataset, and PubMed articles without abstracts\.Table 9:Performance comparison on the PubMed Summarization dataset across primary evaluation dimensions\. Conventions and abbreviations as in Table[1](https://arxiv.org/html/2605.20628#S3.T1)Semantic AlignmentSupportingModelBSSENTFOCUS↓\\downarrowR\-LU\-RCov\.Comp\.Original abstract–––––0\.87816\.084LED\-Arxiv \(base\)0\.6410\.9131\.3980\.2360\.3480\.968 \(\+0\.091\)17\.739 \(\+1\.655\)LED\-Pubmed \(base\)0\.6640\.9291\.1000\.2720\.4150\.960 \(\+0\.082\)14\.235 \(\-1\.849\)LongT5 \(base\)0\.6640\.8941\.5850\.2650\.3450\.962 \(\+0\.085\)25\.643 \(\+9\.560\)DPR\-BAG \(BC\)0\.6510\.8142\.1160\.1920\.2170\.876 \(\-0\.002\)43\.962 \(\+27\.878\)DPR\-BAG \(SI\+CoT\)0\.6520\.8391\.9270\.1970\.2450\.878 \(\+0\.000\)40\.068 \(\+23\.984\)Table 10:Performance comparison on the PubMed Summarization dataset across semantic alignment and supporting metrics\. Conventions and abbreviations as in Table[2](https://arxiv.org/html/2605.20628#S3.T2)Table[9](https://arxiv.org/html/2605.20628#A6.T9)and Table[10](https://arxiv.org/html/2605.20628#A6.T10)present results on the PubMed Summarization dataset for the two best PMC\-MAD configurations: BC \(without entity guidance\) and SI\+CoT\. Consistent with PMC\-MAD findings, DPR\-BAG underperforms all baseline models on ROUGE, UMLS Recall, DiscoScore, and SummaC, while achieving better coverage, density, and novel n\-gram scores closer to the original human\-written abstract\. Both configurations outperform all baselines on AlignScore, confirming that the abstractiveness gains do not compromise factuality\.

However, DPR\-BAG’s MiniCheck advantage on PMC\-MAD does not transfer here, attributable to PubMedSum’s inherent higher abstractiveness \(lower density, higher novelty\), which penalizes NLI\-based metrics \(MiniCheck\) even when factual content is preserved — as evidenced by DPR\-BAG’s consistently superior AlignScore across both datasets\.

Unlike on PMC\-MAD where SI\+CoT only improved BERTScore, on PubMedSum SI\+CoT significantly improves AlignScore, ROUGE\-L, UMLS Recall, DiscoScore, and Compression while showing no significant degradation on factuality \(MiniCheck, SummaC\) or abstractiveness\. This suggests that CoT’s benefit varies with document characteristics, potentially driven by publication\-type distribution differences\.

Table 11:Scale\-up comparison on primary evaluation dimensions\. All configurations use the BC prompt without entity guidance\. Conventions and abbreviations as in Table[1](https://arxiv.org/html/2605.20628#S3.T1)Table 12:Scale\-up comparison on semantic alignment and supporting metrics\. Conventions and abbreviations as in Table[2](https://arxiv.org/html/2605.20628#S3.T2)
## Appendix GEffect of Backbone Model Size

To investigate whether scaling up the backbone LLM improves generation quality, we evaluatedQwen2\.5:7BandQwen2\.5:14Bunder the BC prompt without entity guidance\. As the Llama3\.2 series does not offer a model beyond 3B, we use Qwen 2\.5 for the scaling analysis\.

As shown in Table[11](https://arxiv.org/html/2605.20628#A6.T11)and[12](https://arxiv.org/html/2605.20628#A6.T12), Qwen2\.5:14B improves BERTScore, UMLS Recall, and DS\-Focus relative to the 3B backbone, recording the best DS\-Focus, and UMLS Recall among all DPR configurations, alongside a compression rate most closely aligned with the original abstracts\. However, both larger backbones exhibit excessively higher abstractiveness \(lower Density, higher Novel n\-grams\) and declining factuality \(SummaC, AlignScore and MiniCheck\) relative to the 3B backbone, suggesting that larger models promote more abstractive phrasing and precise semantic alignment of key noun foci at the cost of reduced factual overlap with the source\.

## Appendix HBC Entity Guidance Ablations

To isolate the effect of entity guidance from prompt complexity, we additionally apply TR\-UMLS to the BC prompt and compare against BC without entity guidance\.

We integrate TR\-UMLS with top\-5 UMLS entities into the BC prompt \(BC\+Top\-5\)\. As shown in Table[13](https://arxiv.org/html/2605.20628#A8.T13), BC\+Top\-5 significantly degrades factuality relative to BC \(p<0\.001p<0\.001\) and decreases UMLS Recall \(p<0\.001p<0\.001\), indicating that the injected UMLS terms fail to anchor concept reproduction in the output\. These results confirm that TR\-UMLS entity guidance remains ineffective in this zero\-shot setting regardless of the underlying prompt strategy\. Significance details are reported in Appendix[J\.5](https://arxiv.org/html/2605.20628#A10.SS5)\.

Table 13:Ablation of TR\-UMLS entity guidance applied to the BC prompt\. \(Abbreviations as in Table[1](https://arxiv.org/html/2605.20628#S3.T1)and Table[2](https://arxiv.org/html/2605.20628#S3.T2)\)
## Appendix IEvaluation Metric Details

### I\.1Factual Consistency \(AlignScore, MiniCheck and SummaC\)

##### AlignScoreZhaet al\.\([2023](https://arxiv.org/html/2605.20628#bib.bib41)\):

AlignScore trains a unified alignment model on 4\.7M examples spanning 7 tasks—NLI, fact verification, paraphrase, semantic textual similarity, question answering, information retrieval, and summarization—producing a single factual consistency score\. At inference time, the source document is split into overlapping chunks of approximately 350 tokens, and each sentence in the generated abstract is evaluated against all chunks; the highest alignment score per sentence is averaged to yield the final score\. We use the base variant \(RoBERTa\-base, 125M parameters\)\.

##### MiniCheckTanget al\.\([2024](https://arxiv.org/html/2605.20628#bib.bib42)\):

MiniCheck trains a small fact\-checking model on synthetically constructed data generated by GPT\-4, where each instance is designed to require verifying multiple atomic facts against multi\-sentence evidence\. It produces a binary supported/unsupported prediction per sentence\. In our evaluation, each sentence in the generated abstract is treated as an individual claim and verified against the source full\-text document\. We use the Flan\-T5\-Large variant \(770M parameters\)\.

##### SummaCLabanet al\.\([2022](https://arxiv.org/html/2605.20628#bib.bib40)\):

SummaC segments the source document into individual sentences and scores each generated sentence by aggregating sentence\-level NLI entailment probabilities against all source sentences\. The SummaCConv\{\}\_\{\\text\{Conv\}\}variant learns a convolutional layer over the full distribution of these entailment scores, rather than relying only on the maximum, making it more robust to outliers\. We use theSummaCConvimplementation with a VitaminC\-trained NLI backbone and sentence\-level granularity\.

### I\.2DiscoScore \(DS\_SENT\_NN & DS\_FOCUS\_NN\)

DiscoScoreZhaoet al\.\([2023](https://arxiv.org/html/2605.20628#bib.bib19)\)evaluates discourse coherence by modeling focus transitions across sentences\. In this work, we use nouns \(NN\) as the focus, one of several focus choices supported by the metric, to compare the discourse coherence between the reference and generated abstracts\.

##### DS\_SENT\_NN:

Constructs a sentence graph where edges are drawn between any two sentences \(not just adjacent ones\) that share at least one noun focus\. Edge weights are inversely proportional to the distance between the two sentences \(1/\(j−i\)1/\(j\-i\)\), capturing both local and long\-range coherence\. Sentence embeddings are then aggregated according to this graph structure, and the cosine similarity between the resulting graph\-level embeddings of the generated and reference texts is used as the final score\.

##### DS\_FOCUS\_NN:

Measures how closely the frequency and semantics of shared noun foci match between the generated and reference texts\. For each focus shared by both texts, it computes the distance between their embeddings \(derived by summing the contextualized token embeddings of all associated tokens\), and averages these distances as the final score\.

### I\.3UMLS Recall

This metric measures how well the generated abstract preserves biomedical concepts present in the reference abstract\. We extract unique UMLS concepts from both the original abstract \(CrefC\_\{\\text\{ref\}\}\) and the generated abstract \(CgenC\_\{\\text\{gen\}\}\) using a biomedical entity linker \(scispaCy,Neumannet al\.,[2019](https://arxiv.org/html/2605.20628#bib.bib45)\) with the UMLS knowledge base\)\. UMLS Recall is then computed as:

UMLS Recall=\|Cgen∩Cref\|\|Cref\|\\text\{UMLS\\ Recall\}=\\frac\{\|C\_\{\\text\{gen\}\}\\cap C\_\{\\text\{ref\}\}\|\}\{\|C\_\{\\text\{ref\}\}\|\}\(1\)
##### NewsroomGruskyet al\.\([2018](https://arxiv.org/html/2605.20628#bib.bib20)\): Coverage, Density, and Compression:

We adopt the extractive fragment measures fromGruskyet al\.\([2018](https://arxiv.org/html/2605.20628#bib.bib20)\)\. Given a summarySSand source documentDD, extractive fragments are the set of longest common substrings shared betweenSSandDD\.Coveragemeasures the proportion of summary tokens belonging to an extractive fragment\.Densitymeasures the average length of the extractive fragments, reflecting how verbatim the summary is\.Compressionis the ratio of source document length to summary length\.

##### N\-gram Novelty

FollowingSeeet al\.\([2017](https://arxiv.org/html/2605.20628#bib.bib21)\), we compute the proportion of bigrams and trigrams in the generated abstract that do not appear in the source full\-text document\. Higher novelty indicates greater abstractive synthesis relative to the source\.

## Appendix JStatistical Significance Analysis

### J\.1Method

For each comparison between two configurationsAAandBB, we align per\-document scores by article ID and resample with replacement over 10,000 iterations\. For metrics where higher \(or lower\) scores indicate better performance \(AlignScore, MiniCheck, SummaC, BERTScore, DiscoScore variants, ROUGE\-L, UMLS Recall\), we testH0:𝔼​\[sA\]=𝔼​\[sB\]H\_\{0\}\\\!:\\mathbb\{E\}\[s\_\{A\}\]=\\mathbb\{E\}\[s\_\{B\}\]on raw scores\. For abstractiveness metrics where the goal is proximity to the reference distribution \(Bigram novelty, Trigram novelty, Density, Coverage, Compression\), we instead testH0:𝔼​\[\|sA−sref\|\]=𝔼​\[\|sB−sref\|\]H\_\{0\}\\\!:\\mathbb\{E\}\[\\,\|s\_\{A\}\-s\_\{\\text\{ref\}\}\|\\,\]=\\mathbb\{E\}\[\\,\|s\_\{B\}\-s\_\{\\text\{ref\}\}\|\\,\], wheresrefs\_\{\\text\{ref\}\}is the corresponding score of the human\-written abstract on the same document\. Under this formulation, a negative difference indicates thatAAis closer to the reference thanBB\. All tests are two\-sided, with significance markers∗\(p<0\.05p<0\.05\),∗∗\(p<0\.01p<0\.01\), and∗∗∗\(p<0\.001p<0\.001\)\. Comparisons not marked are not significant atp<0\.05p<0\.05\. The per\-document scores used for testing are extracted from the same evaluation pipeline that produces the average numbers in Section[5](https://arxiv.org/html/2605.20628#S5); only articles with valid scores from both configurations are included in each pairwise test\.

In all tables below, mean differences are computed asA−BA\\,\-\\,B\. Arrows next to metric names indicate whether higher \(↑\\uparrow\) or lower \(↓\\downarrow\) values are preferred\.

### J\.2DPR\-BAG vs\. Baselines \(PMC\-MAD\)

Table[14](https://arxiv.org/html/2605.20628#A10.T14), Table[15](https://arxiv.org/html/2605.20628#A10.T15), and Table[16](https://arxiv.org/html/2605.20628#A10.T16)report significance for the comparisons between DPR\-BAG \(BC\) and the baseline models on PMC\-MAD\.

Table 14:Mean differences \(DPR\-BAG minus baseline\) between DPR\-BAG \(BC\) and the pretrained LED baselines on PMC\-MAD\.Table 15:Mean differences \(DPR\-BAG minus baseline\) between DPR\-BAG \(BC\) and the fine\-tuned LED\-PubMed and LongT5 baselines\.Table 16:Mean differences \(DPR\-BAG minus baseline\) between DPR\-BAG \(BC\) and the fine\-tuned LongT5 baseline\.
### J\.3Splitting Strategies

Table[17](https://arxiv.org/html/2605.20628#A10.T17)reports significance for the comparisons between FS and other splitting strategies\.

Table 17:Mean differences between FS and other splitting strategies\.
### J\.4Prompting Strategies

Table[18](https://arxiv.org/html/2605.20628#A10.T18)reports significance for the prompting strategy ablation\.

Table 18:Mean differences across prompting strategies\.
### J\.5TR\-UMLS Entity Guidance

Table[19](https://arxiv.org/html/2605.20628#A10.T19)and Table[20](https://arxiv.org/html/2605.20628#A10.T20)report significance for the TR\-UMLS entity guidance ablation\.

Table 19:Mean differences between DI and TR\-UMLS\-augmented DI\.Table 20:Mean differences between BC and TR\-UMLS\-augmented variants\.
### J\.6Chain\-of\-Thought Guidance

Table[21](https://arxiv.org/html/2605.20628#A10.T21)reports significance for the SI\+CoT ablation\.

Table 21:Mean differences for the SI\+CoT ablation\.
### J\.7PubMedSum Generalization

Table[22](https://arxiv.org/html/2605.20628#A10.T22)reports significance for DPR\-BAG \(BC\) against the baseline models on PubMedSum, and Table[23](https://arxiv.org/html/2605.20628#A10.T23)compares BC against SI\+CoT on the same dataset\.

Table 22:Mean differences \(DPR\-BAG minus baseline\) between DPR\-BAG \(BC\) and baseline models on PubMedSum\.Table 23:Mean differences between DPR\-BAG \(BC\) and DPR\-BAG \(SI\+CoT\) on PubMedSum\.
### J\.8Backbone Scale

Table[24](https://arxiv.org/html/2605.20628#A10.T24)reports significance for the backbone scaling experiments\.

Table 24:Mean differences \(Llama\-3\.2:3B minus Qwen\) between Llama\-3\.2:3B and the scaled Qwen2\.5 backbones \(denoted Qwen\-7B and Qwen\-14B\)\.

## Appendix KCase Study: Qualitative Comparison of Generated Abstracts

As shown in Table[25](https://arxiv.org/html/2605.20628#A11.T25), both LED and LongT5 models exhibit severe verbatim copying from the source full text, with LongT5 further producing a lexical hallucination\. In contrast, DPR\-BAG demonstrates stronger topic identification and more abstractive generation for the BAG task, though it still occasionally exhibits entity relation confusion \(e\.g\., misattributing target genes as lncRNAs\)\.

Table 25:Qualitative comparison of generated abstracts for a sample article \(PMCID: PMC6625196\) from the PMC\-MAD test set\.Orange: verbatim copy from source full text\.Red: factual error\.

Similar Articles

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

arXiv cs.CL

Disco-RAG proposes a discourse-aware retrieval-augmented generation framework that integrates discourse signals through intra-chunk discourse trees and inter-chunk rhetorical graphs to improve knowledge synthesis in LLMs. The method achieves state-of-the-art results on QA and summarization benchmarks without fine-tuning.

Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG

arXiv cs.CL

This paper compares two strategies for injecting structured biomedical knowledge from the UMLS Metathesaurus into language models: continual pretraining (embedding knowledge into model parameters) and GraphRAG (querying a knowledge graph at inference time). Results show improvements on biomedical QA benchmarks, with GraphRAG on LLaMA 3-8B yielding over 3 and 5 accuracy points on PubMedQA and BioASQ respectively without any retraining.