Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

arXiv cs.CL Papers

Summary

This paper investigates few-shot biomedical relation extraction using prompt-based learning with LLMs, comparing pairwise classification and joint generation approaches. The best model achieves micro-F1 of 0.44, outperforming previous few-shot results but remaining below supervised baselines, while macro-F1 surpasses the supervised baseline on rare relation types.

arXiv:2606.15412v1 Announce Type: new Abstract: Biomedical relation extraction (BioRE) is a key step in transforming biomedical literature into structured knowledge. However, most existing approaches rely on supervised models trained on costly annotated datasets, limiting their scalability and adaptability across relation types and domains. We investigate few-shot BioRE using prompt-based learning with large language models (LLMs) and compare two task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call. Experiments on the BioREDirect dataset reveal a clear precision-recall trade-off. Pairwise classification achieves higher recall, whereas joint generation is more precise and computationally efficient. The best-performing model achieves a micro-F1 score of 0.44, substantially outperforming previous few-shot results (0.34) while remaining below the supervised baseline (0.56). Much of this gap is attributable to a single ambiguously defined relation type. When evaluated using macro-F1, which better captures performance across relation types in an imbalanced setting, prompt-based approaches outperform the supervised baseline (0.45 vs. 0.38), particularly on rare relation types. These findings highlight the potential of LLMs for BioRE in low-resource settings and underscore the importance of well-defined relation schemas.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:48 AM

# Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?
Source: [https://arxiv.org/html/2606.15412](https://arxiv.org/html/2606.15412)
11institutetext:University of Ljubljana, Ljubljana, Slovenia
Baylor College of Medicine, Houston, Texas, USA###### Abstract

Biomedical relation extraction \(BioRE\) is a key step in transforming biomedical literature into structured knowledge\. However, most existing approaches rely on supervised models trained on costly annotated datasets, limiting their scalability and adaptability across relation types and domains\. We investigate few\-shot BioRE using prompt\-based learning with large language models \(LLMs\) and compare two task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call\. Experiments on the BioREDirect dataset reveal a clear precision–recall trade\-off\. Pairwise classification achieves higher recall, whereas joint generation is more precise and computationally efficient\. The best\-performing model achieves a micro\-F1 score of 0\.44, substantially outperforming previous few\-shot results \(0\.34\) while remaining below the supervised baseline \(0\.56\)\. Much of this gap is attributable to a single ambiguously defined relation type\. When evaluated using macro\-F1, which better captures performance across relation types in an imbalanced setting, prompt\-based approaches outperform the supervised baseline \(0\.45 vs\. 0\.38\), particularly on rare relation types\. These findings highlight the potential of LLMs for BioRE in low\-resource settings and underscore the importance of well\-defined relation schemas\.

## 1Introduction

The volume of scientific publications has expanded rapidly in recent decades, leading to an unprecedented accumulation of written knowledge\. This trend is particularly pronounced in the biomedical field, where the PubMed database alone indexes more than 1\.5 million new articles annually\[[13](https://arxiv.org/html/2606.15412#bib.bib31)\], not including other sources of biomedical text such as clinical notes and electronic health records\. While this growth creates new opportunities for discovery, it also renders manual analysis and knowledge integration increasingly infeasible\.

Natural language processing techniques address this challenge by converting unstructured text into structured representations\. A knowledge graph \(KG\) is a widely adopted framework that encodes entities and their relationships in both human\- and machine\-readable form\[[9](https://arxiv.org/html/2606.15412#bib.bib13)\]\. Constructing KGs from text relies on two core information extraction tasks: named entity recognition \(NER\), which identifies entity mentions, and relation extraction \(RE\), which determines semantic relationships between entities \(Figure[1](https://arxiv.org/html/2606.15412#S1.F1)\)\. While modern NER approaches achieve strong performance with robust tools\[[16](https://arxiv.org/html/2606.15412#bib.bib19)\], RE remains challenging due to its sensitivity to complex language, long\-range dependencies, and domain variability\.

![Refer to caption](https://arxiv.org/html/2606.15412v1/figures/BioREDirect_scheme.png)Figure 1:Conversion of unstructured text into structured biomedical knowledge\. A text passage is first annotated with named entities using NER \(A\)\. Next, RE is applied to identify semantic relationships between entities \(B\)\. The output is organized into a KG, where nodes represent entities and edges represent their relationships \(C\)\. The example is adapted from the BioREDirect dataset\[[11](https://arxiv.org/html/2606.15412#bib.bib39)\], which we use in our analyses\.Biomedical RE \(BioRE\) methods typically rely on supervised learning with expert\-annotated data, which is costly and labor\-intensive to obtain\. Moreover, these methods often struggle to generalize to rare or novel relation types and to new domains\[[6](https://arxiv.org/html/2606.15412#bib.bib5)\]\. This has motivated interest in prompt\-based methods with large language models \(LLMs\) that are better suited to low\-resource settings and cross\-domain generalization\. Trained on vast text corpora, LLMs encode extensive linguistic and factual knowledge in their parameters, which can be leveraged during inference\. The combination of learned knowledge and prompt\-conditioned generation enables LLMs to perform a wide range of tasks without task\-specific training\[[3](https://arxiv.org/html/2606.15412#bib.bib56)\]\. However, reported few\-shot performance of LLMs for BioRE varies considerably, ranging from substantial underperformance relative to fully supervised methods\[[10](https://arxiv.org/html/2606.15412#bib.bib74),[12](https://arxiv.org/html/2606.15412#bib.bib40),[21](https://arxiv.org/html/2606.15412#bib.bib58)\]to competitive results\[[1](https://arxiv.org/html/2606.15412#bib.bib75),[8](https://arxiv.org/html/2606.15412#bib.bib68),[23](https://arxiv.org/html/2606.15412#bib.bib43)\], which motivates further research\.

In this work, we investigate how prompting strategies affect BioRE performance across two task formulations:*pairwise classification*, in which the model predicts the relation between a single annotated entity pair at a time, and*joint generation*, in which the model predicts multiple relations among all annotated entities in a single call\. We compare these two paradigms in terms of extraction performance and computational efficiency; to our knowledge, such a systematic comparison has not yet been conducted for BioRE\. We perform our analysis using the recent open\-weight Gemma\-4 and Qwen\-3\.5 model families\.

## 2Related Work

RE is the task of identifying semantic relationships between entities in text\. It encompasses several subtasks, including identifying relevant entity pairs, classifying their relation types, and determining relation directionality\. In the biomedical domain, RE is essential for uncovering interactions among genes, proteins, diseases, and chemical compounds, supporting applications such as drug discovery, pathway analysis, and disease modeling\.

#### 2\.0\.1Rule\-Based Methods\.

Early BioRE approaches relied on co\-occurrence heuristics or manually designed rules based on lexical and syntactic patterns\[[2](https://arxiv.org/html/2606.15412#bib.bib76),[7](https://arxiv.org/html/2606.15412#bib.bib77)\]\. While some achieve high precision, these methods are typically labor\-intensive and suffer from low recall and limited generalizability\. In particular, they struggle to capture complex linguistic phenomena such as negation and long\-range dependencies and often fail to transfer to new datasets or domains\[[5](https://arxiv.org/html/2606.15412#bib.bib11)\]\.

#### 2\.0\.2Supervised Learning\.

Modern BioRE approaches are predominantly based on supervised learning, where models are trained to predict relations from annotated examples\. Early methods relied on traditional machine learning techniques such as support vector machines\[[15](https://arxiv.org/html/2606.15412#bib.bib70)\]and graph convolutional networks\[[22](https://arxiv.org/html/2606.15412#bib.bib72)\], while more recent approaches are based on pre\-trained language models \(PLMs\)\[[18](https://arxiv.org/html/2606.15412#bib.bib73)\]\.

PLM\-based methods typically follow a two\-stage paradigm\. First, models are pre\-trained on large unlabeled corpora using a language modeling objective\. Second, they are fine\-tuned on task\-specific datasets using supervised objectives tailored to the extraction task\. The most widely adopted architecture is Bidirectional Encoder Representations from Transformers \(BERT\)\[[4](https://arxiv.org/html/2606.15412#bib.bib78)\], which has become the dominant backbone for modern BioRE systems\. For example, Laiet al\.\[[11](https://arxiv.org/html/2606.15412#bib.bib39)\]combined PubMedBERT with soft\-prompt tuning and multi\-task learning to achieve state\-of\-the\-art results on the BioREDirect and BC5CDR datasets\. To overcome BERT’s 512\-token input limit, the method employs chunking, allowing the model to leverage information from different sections of a document while simultaneously predicting relation type, directionality, and novelty\.

Additionally, LLMs have been adapted to BioRE through fine\-tuning\. For example, Penget al\.\[[14](https://arxiv.org/html/2606.15412#bib.bib50)\]investigated clinical relation extraction by comparing full model fine\-tuning, soft\-prompt tuning, and their combination\. They showed consistent improvements across all settings, with the combined approach achieving the best performance\. They further observed that the performance gap between fine\-tuned and untuned models decreases with model size, suggesting that larger LLMs may perform BioRE effectively without task\-specific training\.

Despite strong performance, PLM\- and LLM\-based supervised approaches have several limitations\. These models are computationally expensive to train and generalize poorly when trained on small or highly specialized datasets\[[24](https://arxiv.org/html/2606.15412#bib.bib41)\]\. Therefore, their effectiveness is heavily dependent on large, high\-quality annotated datasets, which are costly and time\-consuming to construct in the biomedical domain due to the need for expert annotation\[[17](https://arxiv.org/html/2606.15412#bib.bib44)\]\.

#### 2\.0\.3Prompt\-Based Learning\.

To address the limitations of supervised systems, prompt\-based learning with LLMs, particularly zero\- and few\-shot approaches, has attracted increasing attention\. As model architectures and training corpora continue to scale, LLMs demonstrate a growing ability to adapt to downstream tasks without task\-specific fine\-tuning\[[23](https://arxiv.org/html/2606.15412#bib.bib43)\]\. Prompts typically describe the BioRE task, target relation types, and extraction criteria, and include a small number of demonstrations to guide the model towards the desired behavior\. Recent work has also explored advanced prompting strategies such as chain\-of\-thought, question\-answering, and self\-verification to further improve performance\[[19](https://arxiv.org/html/2606.15412#bib.bib10)\]\.

Existing work has primarily explored two main BioRE task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call\. Using pairwise classification, Zhaoet al\.\[[23](https://arxiv.org/html/2606.15412#bib.bib43)\]demonstrated that carefully designed prompts can achieve performance competitive with supervised methods\. Their approach combines task instructions describing the target relations and extraction criteria with both positive and negative examples, yielding substantially higher recall than the corresponding supervised baselines, albeit often at the cost of lower precision\.

In contrast, Liuet al\.\[[12](https://arxiv.org/html/2606.15412#bib.bib40)\]evaluated several open\-source LLMs using joint generation in both few\-shot and fine\-tuned settings\. They found that larger models substantially outperform smaller ones and that parameter\-efficient fine\-tuning methods such as LoRA can partially close the performance gap\. Nevertheless, even the strongest evaluated LLMs remained well below supervised BERT\-based approaches\. Notably, their prompts did not explicitly describe the target relation types, which may have limited the model’s ability to distinguish them effectively\.

Overall, prompt\-based methods offer a flexible alternative to supervised systems, particularly in low\-resource and cross\-domain settings, as they can generalize from only a small number of examples\. However, their effectiveness depends strongly on prompt design, model architecture, and task formulation\. Because LLM inference is substantially more costly than that of conventional supervised models, both extraction quality and computational efficiency must be considered when evaluating BioRE systems\. This motivates a systematic evaluation of prompt\-based methods and LLM architectures for BioRE\.

## 3Evaluation Setup

We evaluated prompt\-based learning for BioRE in a few\-shot setting using recent LLMs, comparing pairwise classification and joint generation task formulations\. Our work considers both extraction quality and computational efficiency and relates the results to existing supervised and prompt\-based baselines\.

Unlike most prior work that focuses on sentence\-level extraction\[[24](https://arxiv.org/html/2606.15412#bib.bib41)\], we investigate document\-level BioRE\. This setting is more challenging due to the larger number of entity pairs and long\-range dependencies, but better reflects real\-world texts, where relations often span sentence boundaries\[[20](https://arxiv.org/html/2606.15412#bib.bib53)\]\.

### 3\.1Dataset

Table 1:BioREDirect dataset entity and relation type labels\. Test split counts are reported\. Species and CellLine entity types are omitted because their relations are excluded from the annotation schema\.TypeCountDescriptionEntity TypesGeneOrGeneProduct5,724Genes/proteins \(e\.g\.,TP53, EGFR\)DiseaseOrPhenotypicFeature3,635Diseases/phenotypes \(e\.g\., breast cancer\)ChemicalEntity2,582Drugs/chemicals \(e\.g\., aspirin, glucose\)SequenceVariant1,774Genetic variants \(e\.g\.,BRAFV600E\)Relation TypesAssociation2,759Relation without clear polarity or mechanismPositive Correlation1,751Promotes, induces, or increasesNegative Correlation1,192Inhibits, treats, or decreasesCotreatment172Combination drug therapyBind136Direct molecular bindingComparison13Explicit comparison onlyConversion13One chemical converts to anotherDrug Interaction0Pharmacological drug interactionExperiments were conducted on the BioREDirect dataset\[[11](https://arxiv.org/html/2606.15412#bib.bib39)\], which consists of PubMed abstracts hand\-annotated with six entity types and eight relation types, as described in Table[1](https://arxiv.org/html/2606.15412#S3.T1)\. Relation annotations are not provided for Species and CellLine entities or for Gene–Variant and Disease–Disease pairs\. The relation type distribution is highly imbalanced, withAssociation,Positive Correlation, andNegative Correlationaccounting for more than 95% of all relation instances\. Our evaluations were performed on the test split, which contains 400 abstracts and 6,036 annotated relation instances\.

### 3\.2Models

We employed models from two recent open\-source LLM families, Gemma\-4111[https://huggingface\.co/collections/google/gemma\-4](https://huggingface.co/collections/google/gemma-4)and Qwen\-3\.5222[https://huggingface\.co/collections/Qwen/qwen35](https://huggingface.co/collections/Qwen/qwen35), which have demonstrated strong performance across a range of benchmarks\. We focused on medium\-sized variants \(25–35B parameters\) that can be reliably deployed on a single NVIDIA DGX Spark system with 128 GB of unified memory\.

The evaluated models include both dense and mixture\-of\-experts \(MoE\) architectures\. Dense models utilize all parameters for every input token, leading to consistent but computationally expensive inference\. In contrast, MoE models activate only a subset of parameters \(experts\) per token, reducing computational costs while maintaining high total parameter counts\.

All models were evaluated in full precision \(BF16\) using deterministic decoding with temperature, presence penalty, and frequency penalty set to 0, top\-p and top\-k set to 1, and a fixed random seed\. Experiments were conducted both with and without*reasoning*enabled, which allows the model to generate intermediate reasoning steps before producing a final response, potentially improving performance on complex tasks at the cost of increased inference latency\.

### 3\.3Tasks

![Refer to caption](https://arxiv.org/html/2606.15412v1/figures/RE_task.png)Figure 2:Comparison of BioRE task formulations for an input text withN=5N=5entities\. The color highlights represent the annotated entities\. In the pairwise classification setting, each entity pair is evaluated independently, resulting in 10 model calls\. In the joint generation setting, each call contains a subset ofkkentities, constructed such that every entity pair co\-occurs in at least one call\. Withk=3k=3, all pairs are covered in 4 calls, whilek=Nk=Nreduces the task to a single call\.We performed BioRE using prompt\-based learning\. Each prompt contained task instructions, relation type descriptions, and a small set of few\-shot examples\. Relation descriptions were derived from the original annotation guidelines used during dataset construction\. The model was instructed to extract only relations explicitly stated in the text; co\-mentions, indirect reasoning, and ambiguous relation directionality were treated as negative instances\. Few\-shot examples were sampled from the train split and cover all relation types\. The input text, annotated with the entities of interest, was appended to the end of the prompt\.

We evaluated two task formulations \(Figure[2](https://arxiv.org/html/2606.15412#S3.F2)\):

- •Pairwise Classification:The model receives the input text annotated with a single entity pair and predicts the relation type between them, or*None*if no relation is supported\. This procedure is repeated independently for each candidate entity pair\.
- •Joint Generation:The model receives the input text annotated with multiple entities and generates all supported\(head,tail,relation\)\(\\text\{head\},\\text\{tail\},\\text\{relation\}\)triples in a single call\. If no valid relations are present, the model outputs*None*\.

To study the trade\-off between extraction quality and computational efficiency, we additionally varied the context scope in the joint generation setting using a parameterkk, which defines the number of entities included in a single model call\. Atk=Nk=N, all entities in an abstract are processed together in a single call\. For smaller values ofkk, entities are divided into overlapping subsets constructed so that every entity pair appears in at least one subset\. We used a greedy approximation algorithm to construct a near\-minimal set of subsets covering all entity pairs\. Predictions from overlapping subsets were aggregated by majority vote for each entity pair, with ties discarded as ambiguous\.

### 3\.4Metrics

Extraction performance was evaluated using precision, recall, and F1 scores based on exact matching against gold annotations\. Metrics were computed for both entity pair \(EP\) and relation type \(RT\) extraction\. Due to substantial class imbalance, we report both micro\- and macro\-F1 scores\.

In addition, we evaluated computational efficiency using weighted token cost \(Cost\)\. LetTi​nT\_\{in\}andTo​u​tT\_\{out\}denote the number of input and output tokens required per extraction run, and letcic\_\{i\}andcoc\_\{o\}denote their relative processing costs\. Total cost is computed as

Cost=ci​Ti​n\+co​To​u​t\\textit\{Cost\}=c\_\{i\}T\_\{in\}\+c\_\{o\}T\_\{out\}\(1\)
The relative costscic\_\{i\}andcoc\_\{o\}were estimated empirically from model throughput benchmarks\. Input and output throughput were measured separately as the average number of tokens processed per second during prompt prefill and token decoding, respectively, using repeated sequential inference requests with synthetic prompts that contain unique prefixes to prevent prompt\-cache reuse\. The corresponding costs were defined as the inverse throughput\. Consequently, slower operations receive a higher cost weight\.

## 4Results

Table 2:General BioRE performance\. F1 scores are reported for EP and EP\+RT with reasoning disabled/enabled\. For EP\+RT condition, both Micro\-F1 and Macro\-F1 are reported due to relation type imbalance\. Best results are shown in bold\.Table[2](https://arxiv.org/html/2606.15412#S4.T2)summarizes extraction performance across model architectures and task formulations\. The highest EP\+RT micro\-F1 score of 0\.44 is achieved by Qwen3\.5\-27B in the classification setting\. Our approaches substantially outperform the few\-shot results reported by Liuet al\.\[[12](https://arxiv.org/html/2606.15412#bib.bib40)\]but remain below the supervised baselines of Laiet al\.\[[11](https://arxiv.org/html/2606.15412#bib.bib39)\]\. This is largely attributable to performance on theAssociationclass\. Considering macro\-F1 scores, the performance gap narrows considerably: the best result is achieved by Gemma\-4 models in the generation setting, with an EP\+RT macro\-F1 score of 0\.45, compared with 0\.38 for the BERT baseline\.

![Refer to caption](https://arxiv.org/html/2606.15412v1/figures/BioREDirect_mistake.png)Figure 3:Example of BioREDirect annotation ambiguity: the abstract contains three entities – SNHG15 \(lncRNA\), miR\-18a \(miRNA\), and CXCL13 \(protein\)\. All pairwise relations are labeled asAssociation\. However, the text provides evidence for more specific relations\. The relation between SNHG15 and miR\-18a is explicitly described asBind\(“SNHG15 was found to bind to miR\-18a”\), while the statement “Silencing of SNHG15 led to CXCL13 upregulation” suggests a negative regulatory relation between SNHG15 and CXCL13\. This illustrates how theAssociationlabels in the gold dataset can mask more specific relationships described in the text\.TheAssociationclass is inherently ambiguous as it serves as a fallback label during annotation for cases where a more specific relation type cannot be determined confidently\. Such ambiguity may pose a greater challenge for instruction\-driven LLM approaches than for supervised models trained directly on the annotated data\. Moreover, our qualitative analysis suggests that some instances labeled asAssociationcould reasonably be assigned more specific relation types\. Figure[3](https://arxiv.org/html/2606.15412#S4.F3)illustrates one such case where our model predicts more specific relation types and is penalized when evaluated against the gold annotation\. While this does not necessarily indicate an annotation error, it highlights the subjective nature of relation categorization in borderline cases\. As a result, part of the observed performance difference may reflect limitations of the annotation scheme rather than extraction capability, suggesting that the true performance gap may be smaller than the BioREDirect scores indicate\.

Table 3:Relation\-Level BioRE performance\. F1 scores \(together with precision and recall in the brackets\) are reported for EP\+RT task\. The results are shown for Gemma\-4\-31B\-it model with reasoning enabled\.Table[3](https://arxiv.org/html/2606.15412#S4.T3)shows extraction performance per relation type for gemma\-4\-31B\-it\. The model outperforms the supervised baseline on several low\-frequency relation types, includingBind,Comparison, andCotreatment\. This suggests that LLMs are less affected by class imbalance when relation definitions are sufficiently specific and semantically distinct, representing a key advantage of LLM\-based approaches for BioRE\. However, these results should be interpreted with caution, as the number of evaluation examples for these relation types is low \(Table[1](https://arxiv.org/html/2606.15412#S3.T1)\) and performance estimates may therefore be subject to substantial variance\.

#### 4\.0\.1Task Formulations\.

The two task formulations exhibit a clear precision–recall trade\-off\. As shown in Table[3](https://arxiv.org/html/2606.15412#S4.T3), pairwise classification consistently achieves higher recall, whereas joint generation yields higher precision at the expense of recall\. This pattern is observed across model architectures and is consistent with prior work by Zhaoet al\.\[[23](https://arxiv.org/html/2606.15412#bib.bib43)\]\. It likely reflects the autoregressive nature of generation, where uncertain relations are often omitted rather than predicted\. In contrast, classification evaluates predefined entity pairs, resulting in broader relation coverage and more consistent extraction behavior\.

Introducingkk\-constrained generation provides a mechanism for navigating this precision–recall trade\-off\. By limiting the number of entity pairs considered in a single generation step \(k<Nk<N\), the model achieves precision levels that remain higher than for pairwise classification while recovering some of the recall lost in unconstrained generation\. As a result,kkacts as a controllable parameter that allows practitioners to balance extraction conservativeness against relation coverage according to task requirements\.

Table 4:Cost analysis\. Relative input and output cost coefficients \(cic\_\{i\}andcoc\_\{o\}\) are estimated from the inverse throughput of each model\. For easier comparison, all costs are normalized with respect to Gemma\-4\-31B\-it \(joint generation withk=Nk=Nand reasoning enabled\)\. Cost values are reported both for reasoning disabled/enabled\.The relative costs presented in Table[4](https://arxiv.org/html/2606.15412#S4.T4)reveal a substantial efficiency advantage for joint generation, reducing computational costs by up to 25×\\timescompared to pairwise classification\. This difference arises because classification requires a separate model call for each entity pair, whereas generation can extract multiple relations at once\. Thekk\-constrained generation provides an intermediate cost profile\. By extracting relations for multiple entity pairs per call, it remains substantially more efficient than pairwise classification, while the need for multiple calls to cover all pairs results in higher costs than full joint generation\.

#### 4\.0\.2Model architectures and reasoning\.

No single model family consistently outperforms the others across all settings\. In the pairwise classification setting, Qwen\-3\.5 models achieve the strongest results in both micro\- and macro\-F1, whereas Gemma\-4 models perform best in the joint generation setting\.

Enabling reasoning generally improves extraction performance, likely by facilitating more effective processing of complex relations\. The gains are particularly pronounced for MoE models, which appear to benefit more from reasoning than dense architectures\. However, these improvements come at a substantial computational cost\. Reasoning significantly increases the number of generated output tokens, which are considerably more expensive than input tokens \(see theccvalues in Table[4](https://arxiv.org/html/2606.15412#S4.T4)\), thereby increasing inference costs considerably\.

Computational efficiency also varies across architectures\. MoE models are consistently more cost\-efficient than dense models despite producing a similar number of output tokens\. By activating only a subset of parameters for each token, MoE architectures reduce inference costs substantially\. Nevertheless, dense models retain a slight performance advantage, particularly when reasoning is disabled\.

#### 4\.0\.3Limitations and future work\.

Several avenues may further improve extraction performance\. Larger and biomedical domain\-specific LLMs are promising candidates, while uncertainty\-aware prediction strategies that leverage token probabilities could help filter low\-confidence predictions in the classification setting\. Additional gains may come from advanced prompting techniques, such as chain\-of\-thought reasoning, and task\-specific preprocessing methods, including entity masking, synonym replacement, and irrelevant entity filtering\. From an efficiency perspective, inference costs could be reduced through model quantization, improving deployment practicality without requiring architectural changes\.

A limitation of this study is that complete reproducibility cannot be guaranteed\. Although all experiments used deterministic decoding with a temperature of zero, small variations in model outputs were observed across runs\. This behavior is a known characteristic of LLM inference and likely arises from low\-level computational differences during execution\. Consequently, evaluation scores may exhibit minor fluctuations\. Repeated evaluations and confidence intervals should therefore be used to better characterize performance variability\.

Finally, our evaluation focused exclusively on EP and RT extraction and did not assess relation directionality or novelty, which are also annotated in the dataset\. Incorporating these tasks would provide a more comprehensive evaluation of BioRE performance\. Future studies could additionally compare within\- and cross\-sentence relations to further disentangle the strengths of few\-shot and supervised approaches\. Given their ability to process longer contexts, LLMs may outperform BERT\-based models on cross\-sentence relation extraction\.

## 5Conclusion

In this work, we evaluated modern medium\-sized LLMs for few\-shot biomedical relation extraction\. Our results demonstrate that they provide a viable alternative to supervised systems, particularly in settings with limited annotated data and well\-defined relation semantics\. Although the proposed approaches remain below the supervised baseline in terms of micro\-F1, this gap appears to be strongly influenced by dataset imbalance and ambiguity in theAssociationclass\. When considering macro\-F1, which better reflects performance across relation types, LLM\-based approaches outperform the supervised baseline, particularly on low\-frequency relations\.

Across task formulations, classification and generation achieved broadly comparable F1 scores, with performance differences primarily reflecting a precision–recall trade\-off rather than a clear superiority of either approach\. Pairwise classification consistently achieved higher recall and is therefore better suited to applications requiring maximal relation coverage\. In contrast, joint generation yielded higher precision while incurring substantially lower computational costs, making it attractive for scenarios where extraction quality and efficiency are prioritized\. The proposed k\-constrained generation setting provides an effective intermediate alternative, enabling practitioners to balance precision, recall, and inference cost according to downstream task requirements\.

## Data and Code Availability

## Funding

This work was supported by core fundingP2\-0209\(Artificial Intelligence and Intelligent Systems\) and project grantsGC\-0001\(Artificial Intelligence for Science\) andL2\-60154\(Explainable Foundation Models for Human Gene Expression\), all funded by ARIS\.

## References

- \[1\]M\. Agrawal, S\. Hegselmann, H\. Lang, Y\. Kim, and D\. Sontag\(2022\-12\)Large language models are few\-shot clinical information extractors\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 1998–2022\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.130/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.130)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p3.1)\.
- \[2\]A\. Ben Abacha and P\. Zweigenbaum\(2011\-10\)Automatic extraction of semantic relations between medical entities: a rule based approach\.Journal of Biomedical Semantics2\(5\),pp\. S4\(en\)\.External Links:ISSN 2041\-1480,[Link](https://doi.org/10.1186/2041-1480-2-S5-S4),[Document](https://dx.doi.org/10.1186/2041-1480-2-S5-S4)Cited by:[§2\.0\.1](https://arxiv.org/html/2606.15412#S2.SS0.SSS1.p1.1)\.
- \[3\]Q\. Chen, Y\. Hu, X\. Peng, Q\. Xie, Q\. Jin, A\. Gilson, M\. B\. Singer, X\. Ai, P\. Lai, Z\. Wang, V\. K\. Keloth, K\. Raja, J\. Huang, H\. He, F\. Lin, J\. Du, R\. Zhang, W\. J\. Zheng, R\. A\. Adelman, Z\. Lu, and H\. Xu\(2025\-04\)Benchmarking large language models for biomedical natural language processing applications and recommendations\.Nature Communications16\(1\),pp\. 3280\(en\)\.External Links:ISSN 2041\-1723,[Link](https://www.nature.com/articles/s41467-025-56989-2),[Document](https://dx.doi.org/10.1038/s41467-025-56989-2)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p3.1)\.
- \[4\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\-06\)BERT: Pre\-training of Deep Bidirectional Transformers for Language Understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§2\.0\.2](https://arxiv.org/html/2606.15412#S2.SS0.SSS2.p2.1)\.
- \[5\]J\. A\. Diaz\-Garcia and J\. A\. D\. Lopez\(2025\-07\)A survey on cutting\-edge relation extraction techniques based on language models\.Artificial Intelligence Review58\(9\),pp\. 287\(en\)\.External Links:ISSN 1573\-7462,[Link](https://doi.org/10.1007/s10462-025-11280-0),[Document](https://dx.doi.org/10.1007/s10462-025-11280-0)Cited by:[§2\.0\.1](https://arxiv.org/html/2606.15412#S2.SS0.SSS1.p1.1)\.
- \[6\]N\. Goyal and N\. Singh\(2025\-02\)Named entity recognition and relationship extraction for biomedical text: A comprehensive survey, recent advancements, and future research directions\.Neurocomputing618,pp\. 129171\.Note:5/5: 2024 “comprehensive overview of recent NER and RE techniques in the biomedical domain”\.External Links:ISSN 0925\-2312,[Link](https://www.sciencedirect.com/science/article/pii/S0925231224019428),[Document](https://dx.doi.org/10.1016/j.neucom.2024.129171)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p3.1)\.
- \[7\]M\. Huang, X\. Zhu, Y\. Hao, D\. G\. Payan, K\. Qu, and M\. Li\(2004\-12\)Discovering patterns to extract protein–protein interactions from full texts\.Bioinformatics20\(18\),pp\. 3604–3612\.External Links:ISSN 1367\-4803,[Link](https://doi.org/10.1093/bioinformatics/bth451),[Document](https://dx.doi.org/10.1093/bioinformatics/bth451)Cited by:[§2\.0\.1](https://arxiv.org/html/2606.15412#S2.SS0.SSS1.p1.1)\.
- \[8\]I\. Jahan, M\. T\. R\. Laskar, C\. Peng, and J\. X\. Huang\(2024\-03\)A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks\.Computers in Biology and Medicine171,pp\. 108189\.External Links:ISSN 0010\-4825,[Link](https://www.sciencedirect.com/science/article/pii/S0010482524002737),[Document](https://dx.doi.org/10.1016/j.compbiomed.2024.108189)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p3.1)\.
- \[9\]S\. Ji, S\. Pan, E\. Cambria, P\. Marttinen, and P\. S\. Yu\(2022\-02\)A Survey on Knowledge Graphs: Representation, Acquisition, and Applications\.IEEE Transactions on Neural Networks and Learning Systems33\(2\),pp\. 494–514\.External Links:ISSN 2162\-2388,[Link](https://ieeexplore.ieee.org/document/9416312),[Document](https://dx.doi.org/10.1109/TNNLS.2021.3070843)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p2.1)\.
- \[10\]B\. Jimenez Gutierrez, N\. McNeal, C\. Washington, Y\. Chen, L\. Li, H\. Sun, and Y\. Su\(2022\-12\)Thinking about GPT\-3 In\-Context Learning for Biomedical IE? Think Again\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 4497–4512\.External Links:[Link](https://aclanthology.org/2022.findings-emnlp.329/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.329)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p3.1)\.
- \[11\]P\. Lai, C\. Wei, S\. Tian, R\. Leaman, and Z\. Lu\(2025\-07\)Enhancing biomedical relation extraction with directionality\.Bioinformatics41\(Supplement\_1\),pp\. i68–i76\.External Links:ISSN 1367\-4811,[Link](https://doi.org/10.1093/bioinformatics/btaf226),[Document](https://dx.doi.org/10.1093/bioinformatics/btaf226)Cited by:[Figure 1](https://arxiv.org/html/2606.15412#S1.F1),[§2\.0\.2](https://arxiv.org/html/2606.15412#S2.SS0.SSS2.p2.1),[§3\.1](https://arxiv.org/html/2606.15412#S3.SS1.p1.1),[Table 2](https://arxiv.org/html/2606.15412#S4.T2.5.15.15.1),[Table 2](https://arxiv.org/html/2606.15412#S4.T2.5.16.16.1),[§4](https://arxiv.org/html/2606.15412#S4.p1.1)\.
- \[12\]Y\. Liu and S\. Zhu\(2025\-11\)A comprehensive evaluation of document level biomedical relation extraction using large language models\.In2025 IEEE International conference on Medical Artificial Intelligence \(MedAI\),pp\. 119–123\.External Links:[Link](https://ieeexplore.ieee.org/document/11354086),[Document](https://dx.doi.org/10.1109/MedAI67139.2025.00024)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p3.1),[§2\.0\.3](https://arxiv.org/html/2606.15412#S2.SS0.SSS3.p3.1),[Table 2](https://arxiv.org/html/2606.15412#S4.T2.5.14.14.1),[§4](https://arxiv.org/html/2606.15412#S4.p1.1)\.
- \[13\]J\. Novoa, M\. Chagoyen, C\. Benito, F\. J\. Moreno, and F\. Pazos\(2023\-04\)PMIDigest: Interactive Review of Large Collections of PubMed Entries to Distill Relevant Information\.Genes14\(4\),pp\. 942\(en\)\.External Links:ISSN 2073\-4425,[Link](https://www.mdpi.com/2073-4425/14/4/942),[Document](https://dx.doi.org/10.3390/genes14040942)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p1.1)\.
- \[14\]C\. Peng, X\. Yang, K\. E\. Smith, Z\. Yu, A\. Chen, J\. Bian, and Y\. Wu\(2024\-05\)Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction\.Journal of Biomedical Informatics153,pp\. 104630\.External Links:ISSN 1532\-0464,[Link](https://www.sciencedirect.com/science/article/pii/S1532046424000480),[Document](https://dx.doi.org/10.1016/j.jbi.2024.104630)Cited by:[§2\.0\.2](https://arxiv.org/html/2606.15412#S2.SS0.SSS2.p3.1)\.
- \[15\]Y\. Peng, A\. Rios, R\. Kavuluru, and Z\. Lu\(2018\-01\)Extracting chemical–protein relations with ensembles of SVM and deep learning models\.Database2018,pp\. bay073\.External Links:ISSN 1758\-0463,[Link](https://doi.org/10.1093/database/bay073),[Document](https://dx.doi.org/10.1093/database/bay073)Cited by:[§2\.0\.2](https://arxiv.org/html/2606.15412#S2.SS0.SSS2.p1.1)\.
- \[16\]M\. Sänger, S\. Garda, X\. D\. Wang, L\. Weber\-Genzel, P\. Droop, B\. Fuchs, A\. Akbik, and U\. Leser\(2024\-10\)HunFlair2 in a cross\-corpus evaluation of biomedical named entity recognition and normalization tools\.Bioinformatics40\(10\),pp\. btae564\.External Links:ISSN 1367\-4811,[Link](https://doi.org/10.1093/bioinformatics/btae564),[Document](https://dx.doi.org/10.1093/bioinformatics/btae564)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p2.1)\.
- \[17\]Y\. Shang, Y\. Guo, S\. Hao, and R\. Hong\(2025\-01\)Biomedical Relation Extraction via Adaptive Document\-Relation Cross\-Mapping and Concept Unique Identifier\.arXiv\.Note:arXiv:2501\.05155 \[cs\]Comment: 13 pages, 6 figuresExternal Links:[Link](http://arxiv.org/abs/2501.05155),[Document](https://dx.doi.org/10.48550/arXiv.2501.05155)Cited by:[§2\.0\.2](https://arxiv.org/html/2606.15412#S2.SS0.SSS2.p4.1)\.
- \[18\]Q\. Wei, Z\. Ji, Y\. Si, J\. Du, J\. Wang, F\. Tiryaki, S\. Wu, C\. Tao, K\. Roberts, and H\. Xu\(2020\-03\)Relation Extraction from Clinical Narratives Using Pre\-trained Language Models\.AMIA Annual Symposium Proceedings2019,pp\. 1236–1245\.External Links:ISSN 1942\-597X,[Link](https://pmc.ncbi.nlm.nih.gov/articles/PMC7153059/)Cited by:[§2\.0\.2](https://arxiv.org/html/2606.15412#S2.SS0.SSS2.p1.1)\.
- \[19\]D\. Xu, W\. Chen, W\. Peng, C\. Zhang, T\. Xu, X\. Zhao, X\. Wu, Y\. Zheng, Y\. Wang, and E\. Chen\(2024\-11\)Large language models for generative information extraction: a survey\.Frontiers of Computer Science18\(6\),pp\. 186357\(en\)\.External Links:ISSN 2095\-2236,[Link](https://doi.org/10.1007/s11704-024-40555-y),[Document](https://dx.doi.org/10.1007/s11704-024-40555-y)Cited by:[§2\.0\.3](https://arxiv.org/html/2606.15412#S2.SS0.SSS3.p1.1)\.
- \[20\]Y\. Yao, D\. Ye, P\. Li, X\. Han, Y\. Lin, Z\. Liu, Z\. Liu, L\. Huang, J\. Zhou, and M\. Sun\(2019\-07\)DocRED: A Large\-Scale Document\-Level Relation Extraction Dataset\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 764–777\.External Links:[Link](https://aclanthology.org/P19-1074/),[Document](https://dx.doi.org/10.18653/v1/P19-1074)Cited by:[§3](https://arxiv.org/html/2606.15412#S3.p2.1)\.
- \[21\]J\. Zhang, M\. Wibert, H\. Zhou, X\. Peng, Q\. Chen, V\. K\. Keloth, Y\. Hu, R\. Zhang, H\. Xu, and K\. Raja\(2024\-05\)A Study of Biomedical Relation Extraction Using GPT Models\.AMIA Summits on Translational Science Proceedings2024,pp\. 391–400\.External Links:ISSN 2153\-4063,[Link](https://pmc.ncbi.nlm.nih.gov/articles/PMC11141827/)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p3.1)\.
- \[22\]D\. Zhao, J\. Wang, H\. Lin, X\. Wang, Z\. Yang, and Y\. Zhang\(2021\-06\)Biomedical cross\-sentence relation extraction via multihead attention and graph convolutional networks\.Applied Soft Computing104,pp\. 107230\.External Links:ISSN 1568\-4946,[Link](https://www.sciencedirect.com/science/article/pii/S1568494621001538),[Document](https://dx.doi.org/10.1016/j.asoc.2021.107230)Cited by:[§2\.0\.2](https://arxiv.org/html/2606.15412#S2.SS0.SSS2.p1.1)\.
- \[23\]L\. Zhao, L\. Kang, and Q\. Guo\(2026\-08\)Zero\-shot document\-level biomedical relation extraction via scenario\-based prompt design in two\-stage with LLM\.Computational Biology and Chemistry123,pp\. 108978\.External Links:ISSN 1476\-9271,[Link](https://www.sciencedirect.com/science/article/pii/S1476927126001039),[Document](https://dx.doi.org/10.1016/j.compbiolchem.2026.108978)Cited by:[§1](https://arxiv.org/html/2606.15412#S1.p3.1),[§2\.0\.3](https://arxiv.org/html/2606.15412#S2.SS0.SSS3.p1.1),[§2\.0\.3](https://arxiv.org/html/2606.15412#S2.SS0.SSS3.p2.1),[§4\.0\.1](https://arxiv.org/html/2606.15412#S4.SS0.SSS1.p1.1)\.
- \[24\]X\. Zhao, Y\. Deng, M\. Yang, L\. Wang, R\. Zhang, H\. Cheng, W\. Lam, Y\. Shen, and R\. Xu\(2024\-07\)A Comprehensive Survey on Relation Extraction: Recent Advances and New Frontiers\.ACM Comput\. Surv\.56\(11\),pp\. 293:1–293:39\.External Links:ISSN 0360\-0300,[Link](https://dl.acm.org/doi/10.1145/3674501),[Document](https://dx.doi.org/10.1145/3674501)Cited by:[§2\.0\.2](https://arxiv.org/html/2606.15412#S2.SS0.SSS2.p4.1),[§3](https://arxiv.org/html/2606.15412#S3.p2.1)\.

Similar Articles

Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

arXiv cs.CL

This paper explores using few-shot prompted LLMs for actionable triage categorization of online patient inquiries into self-care, schedule-visit, urgent-clinician-review, or emergency-referral. The best model (Claude Haiku 4.5 with 12-shot prompting) achieves macro-F1 of 0.475, surpassing supervised baselines, but the authors conclude that LLMs can support triage prioritization and selective human review, not autonomous deployment.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

arXiv cs.CL

MedicalBench is a new benchmark for evaluating large language models on medical concept extraction from electronic health records, focusing on implicit reasoning and evidence grounding. It includes 823 expert-annotated examples and shows that current models perform modestly, highlighting the difficulty of extracting implicitly stated medical concepts.