The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP
Summary
This paper introduces ChristBERT, a family of domain-specific RoBERTa-based language models for German clinical NLP, and evaluates three domain adaptation strategies (continued pre-training, pre-training from scratch, and vocabulary adaptation) on medical named entity recognition and text classification tasks, achieving state-of-the-art results.
View Cached Full Text
Cached at: 06/03/26, 09:38 AM
# The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP
Source: [https://arxiv.org/html/2606.03250](https://arxiv.org/html/2606.03250)
\\equalcont
These authors contributed equally to this work\.
\[1,3\]\\fnmRaphael\\surSchmitt\\equalcontThese authors contributed equally to this work\.
1\]\\orgdivSchool of Computation, Information and Technology,\\orgnameTechnical University of Munich,\\orgaddress\\cityMunich,\\countryGermany
2\]\\orgdivChair of IT Infrastructure for Translational Medical Research,\\orgnameFaculty of Applied Computer Science, University of Augsburg,\\orgaddress\\cityAugsburg,\\countryGermany
3\]\\orgdivInstitute of General Practice,\\orgnameFaculty of Medicine and Medical Center, University of Freiburg,\\orgaddress\\countryGermany
###### Abstract
Background:Digital healthcare generates vast amounts of clinical texts that hold potential for AI\-assisted applications\. However, existing German biomedical language models either rely on older architectures or are trained on limited data, which may hinder their performance in real\-world settings\.
Methods:To explore the impact of domain adaptation strategies in German clinical NLP, we developed a family of domain\-specific RoBERTa\-based language models, collectively referred to asChristBERT\(Clinical\- andHealthcare\-RelatedIssues andSubjectsTuned BERT\)\. To address the lack of large\-scale German clinical corpora, we curated a 13\.5 GB dataset consisting of scientific publications, clinical texts, and health\-related web content\. Additionally, we employed data augmentation via translation of English clinical corpora\. Three domain adaptation strategies were explored: continued pre\-training, pre\-training from scratch, and pre\-training with domain\-specific vocabulary adaptation\.
Results:The resulting models were evaluated on three medical named entity recognition and two text classification tasks\. Our models consistently outperformed four existing general\-purpose and medical German models on four out of five tasks\. The results demonstrate that the choice of domain adaptation strategy significantly influences downstream task performance\. Based on the empirical results, pre\-training from scratch is effective for highly specialized clinical texts, whereas continued pre\-training is suited for more commonly written medical texts\.
Conclusions:ChristBERT establishes a new state\-of\-the\-art for German clinical language modeling\. Our findings indicate that the optimal domain adaptation strategy is task\-dependent and remains crucial, as adapted models consistently outperformed general\-purpose language models in our experiments\. To support further research and application in German medical NLP, all developed models are publicly released\.
###### keywords:
Natural Language Processing, Medical Informatics, Machine Learning, Electronic Health Records, Named Entity Recognition, Text Classification, Language Models, Biomedical Text Mining, Germany
## 1Introduction
The digitization of health services and clinical processes has resulted in the healthcare industry generating an ever\-increasing amount of textual data, encompassing electronic health records, clinical notes, medical reports, and discharge letters among many others\. While structured data is frequently used for health economics and registries, the aforementioned unstructured clinical narratives are preferred by physicians to record patients’ clinical information due to their flexibility and efficiency, and make up to 40% of the data generated in current hospital systems\[wang2018clinical,dalianis2009stockholm\]\. The substantial potential of narrative text data to support clinical applications was recognized early\[sager1994natural,borst1991textinfo,friedman1995architectural\]and more recently, research efforts have been directed towards developing medical applications assisted by artificial intelligence \(AI\)\. Prominent applications include decision support systems that assist healthcare professionals in their tasks, alleviating their workload and providing better treatments for patients\[zhou2022natural\]\.
However, the unstructured nature of textual data and the intricacies of the biomedical field pose significant challenges for leveraging its potential\. In such a context, natural language processing \(NLP\) methods could structure that information to support downstream clinical applications\. Recent advancements in NLP brought about by large\-scale pre\-trained language models based on the Transformer\[vaswani2017attention\]architecture, introduced new ways for extracting and analyzing the knowledge contained within the clinical texts\. Through extensive self\-supervised training on vast corpora of text, a model can acquire valuable representations of a language, producing highly effective language models\.
The success of Transformer\-based models like BERT \(Bidirectional Encoder Representations from Transformers\)\[devlin2019bert\]and its improved version RoBERTa\[liu2019roberta\], can be largely attributed to the use of transfer learning expressed in the pretrain\-finetune paradigm\. In this paradigm, a model initially goes through a resource\-demanding training process, i\.e\.pre\-training, using general\-purpose textual data to learn the language structure\. This pre\-training phase is self\-supervised, eliminating the need for labeled data by utilizing objectives like masked language modeling\[devlin2019bert\]\. The model is thenfine\-tunedfor various tasks through a second, more cost\-effective training round using a smaller, labeled, and task\-specific dataset that adjusts the model’s weights to fit the specific task and application domain at hand\.
Direct application of general\-purpose language models to a specific domain might limit performance due to significant distributional differences between general and target domains\. Even within the same language, domain\-specific language can vary significantly from everyday language, leading to the need of domain\-specific models\[arefeva2022tourbert\]\. This particularly holds for the medical domain, where the language is highly specialized and complex\. Medical language features numerous acronyms that are crucial for saving time and space, yet they can be ambiguous and require context to be understood\. Spelling errors are common, and there is an abundance of abbreviations\[tayefi2021challenges\]\. Moreover, the medical vocabulary is highly specialized, as it is not typically used in everyday language, making it unfamiliar to those outside the medical profession\. When the target domain, such as medicine, differs considerably from the pre\-training data, models can be improved by an additional phase of domain\-adaptive training using large, domain\-specific corpora with the same pre\-training objectives\.
Such specifically designed medical language models hold significant promise for enhancing the efficiency and precision of medical document handling\[beltagy2019scibert,huang2019clinicalbert,peng2019transfer,lee2020biobert\]\. For the German medical domain, the effectiveness of such models has been demonstrated by BioGottBERT\[lentzen2022critical\]and medBERT\.de\[bressem2024medbert\]\. However, the availability of open\-source biomedical corpora large enough for domain adaptation is limited, primarily due to the sensitive nature of health\-related data, and is largely confined to the English language, given its established status as the language of science\. Despite these obstacles, advancing medical language models remains crucial, as they have the potential to manage the large volumes of text produced in hospitals every day\. In this work, we aimed to develop a new comprehensive German clinical language model based on the RoBERTa architecture by building upon the foundation laid by GeistBERT\[scheibleschmitt2025geistbertbreathinglifegerman\], hereinafter referred to asChristBERT:Clinical\- andHealthcare\-RelatedIssues andSubjectsTuned BERT\. The main emphasis of this work lies in the construction of a large German pre\-training corpus, encompassing a diverse range of biomedical and clinical texts\. These sources provided a broad spectrum of medical language data, fostering the model’s robustness and applicability\. In order to achieve this, we utilized a combination of mostly publicly available German medical textual data and synthetic German domain texts by augmenting the corpus with translated medical texts\[edunov2018understanding\]\. This approach involves translating a monolingual corpus using neural machine translation models\[ng2019facebook,costa2022no\], allowing us to leverage the vast amount of public English medical texts available\. Based on the constructed corpus, we pre\-trained ChristBERT by using Whole Word Masking \(WWM\) and following three different domain\-adaptation strategies: \(1\) continued pre\-training, \(2\) pre\-training from scratch with general\-purpose vocabulary, and \(3\) pre\-training from scratch with additional prior vocabulary adaptation\. In order to investigate the effects of the different domain\-adaptation approaches, we evaluated the performance of the resulting models on two domain\-specific downstream tasks: named entity recognition and classification\. The downstream task performance has been thoroughly evaluated and compared to existing medical and general\-purpose German language models\.
## 2Related Work
Past developments in medical NLP research have seen the creation of mature systems for extracting information from English clinical texts like MetaMap\[aronson2010overview\], cTAKES\[savova2010mayo\], MedLEE\[friedman1995architectural,friedman2000broad\]and CLAMP\[soysal2018clamp\]\. These systems have been used for various tasks such as named entity recognition \(NER\), relation extraction, and information retrieval\. Additionally, open competitions such as Informatics for Integrating Biology and the Bedside \(i2b2\)\[uzuner20112010\], National NLP Clinical Challenges \(n2c2\)\[henry20202018,stubbs2019cohort\], and CLEF eHealth\[crestani2019experimental\]challenge from the Conference and Labs of the Evaluation Forum \(CLEF\) promote data and model sharing, further advancing the medical NLP field\. The systems developed to date encompass rule\-based, machine\-learning\-based, and hybrid models\. While rule\-based methods were essential in early developments, the performance of these systems is limited by their reliance on hand\-crafted rules and lexicons, which are difficult to maintain and generalize across different clinical settings\.
In order to overcome these challenges, current research emphasizes machine\-learning techniques\. In particular, deep\-learning approaches like recurrent neural networks \(RNN\) and convolutional neural networks have been widely used in recent years due to their ability to achieve superior performance with adequate training data\. Unlike traditional machine\-learning methods, deep neural networks typically use methods such as Word2Vec\[mikolov2013distributed\], GloVe\[peters2018dissecting\], or FastText\[joulin2017bag\]to represent words as vectors\. These methods create word embeddings by learning relationships between words from large text corpora, eliminating the need for manual feature engineering\. Nevertheless, these methods represent all possible meanings of a word in a single vector, making them unable to distinguish between different word senses based on the surrounding context\. Vaswani et al\.\[vaswani2017attention\]introduced a new model able to provide contextualized word representation called the Transformer\. Originally designed for neural machine translation, the Transformer addresses two limitations of RNNs: lack of parallelization and handling of long\-range dependencies\. It relies on the self\-attention mechanism, which differentially weighs parts of the input\. Since it operates without recurrence, it is more parallelizable and computationally efficient than RNNs\.
In 2019, Devlin et al\.\[devlin2019bert\]utilized parts of the original Transformer architecture to develop BERT, achieving state\-of\-the\-art results in numerous NLP tasks\. Performance of these large\-scale language models heavily depends on the underlying data used for pre\-training\. A homogeneous text corpus generally leads to a poorer performing model compared to one trained on diverse text corpora of high variance\[martin2020camembert\]\. Initially, much of BERT research was conducted with English texts, followed by efforts in multilingual approaches\[conneau2020unsupervised\]\. While multilingual models were trained on extensive texts from numerous languages, it has been shown that single language models outperform these and are even beneficial in terms of efficiency, pre\-training efforts, and downstream task performance as they demand fewer computational resources and smaller datasets compared to the extensive and diverse data required for multilingual models\[scheible2020gottbert,chan2020german,martin2020camembert\]\. In particular, single\-language models trained with the Open Super\-large Crawled ALMAnaCH coRpus \(OSCAR\)\[suarez2019asynchronous\]demonstrated strong performance, benefiting from the corpus’s size and variability\. Notable examples include CamemBERT\[martin2020camembert\]for French, GottBERT\[scheible2020gottbert\]for German, and BERTje\[de2019bertje\]for Dutch\.
With the increasing use of Transformer\-based models in NLP, there is a growing need in the clinical domain for language models that are not only accurate but also efficient, resource\-conscious, and suitable for local processing\. In settings with limited computational resources and strict data privacy requirements, small yet high\-performing domain\-specific models can provide substantial benefits\. Continued pre\-training on in\-domain data has proven effective for enhancing performance on specialised clinical tasks\. In the biomedical field, the pioneering and most recognized pre\-trained model is BioBERT\[lee2020biobert\], which shares the same architecture as BERT\. Following a domain\-adaptation strategy, BioBERT starts with BERT weights pre\-trained on general texts and then refines these weights using biomedical corpora, surpassing the original model and achieved state\-of\-the\-art performance in numerous biomedical text mining tasks, such as clinical concept recognition, gene\-protein relation extraction, and biomedical question answering\. To gather sufficient open\-source biomedical data, the authors utilized repositories like PubMed\[white2020pubmed\]and PMC\[pmcoa\], obtaining 4\.5 billion words from abstracts and 13\.5 billion words from full\-text articles\. A similar method is employed by SciBERT\[beltagy2019scibert\], which retains the original BERT configuration but substitutes the initial general corpora with 1\.14 million scientific articles randomly chosen from Semantic Scholar\. This dataset consists of 82% broad biomedical domain papers and 18% computer science domain papers\. By training from the ground up on biomedical data, SciBERT can utilize a custom dictionary that better represents the domain\-specific word distribution\. Med\-BERT\[liu2021med\]is the first model fully trained on hospital data, particularly semi\-structured electronic health records, leading to enhanced performance in subsequent prediction models\. These approaches have since been refined, either by updating the model architecture to use BERT variants or by expanding the biomedical corpus with additional sources beyond scientific literature\[huang2019clinicalbert,peng2019transfer\]\.
The extensive range of biomedical and clinical BERT\-based models benefit from the abundance of publicly available biomedical data in English, such as MIMIC\[johnson2023mimic,johnson2023mimicnote\], the largest open\-access dataset of medical records, and extensive repositories of biomedical scientific literature\[white2020pubmed\]\. However, most other languages lack access to these valuable resources, making it challenging to achieve the same level of performance as their English counterparts\. Despite this, researchers from various countries have endeavored to pre\-train non\-English biomedical models, utilizing local and often non\-public biomedical text collections\. They have either trained new models from scratch\[akhtyamova2020named\]or applied biomedical domain adaptation to multilingual\[rubel2020biobertpt\]or monolingual\[copara2020contextualized\]versions of BERT\.
For what concerns the German language, advancements in medical language models are significantly delayed and are often propelled solely by commercial software or localized applications\[starlinger2017improve\]\. Stringent data protection laws impede data sharing, leading clinics to restrict data usage to internal purposes\[hellrich2015sharing\]\. These obstacles hinder the sharing of datasets and models, as well as the organization of open challenges involving German datasets\. In spite of these challenges, there have been notable initiatives in recent years: Datasets such as JSynCC\[lohr2018sharing\]and GGPONC\[borchert2022ggponc\]have been released, containing German biomedical language texts that are not subject to data protection concerns\. Recently, the introduction of the BRONCO150\[kittner2021bronco150\]corpus, which includes de\-identified discharge letters, and GPTNERMED\[frei2023gptnermed\], which leverages large language models, has further expanded the availability of German medical text data\. Additionally, the CLEF eHealth challenge in 2019 provided a dataset of non\-technical summaries of animal studies to be classified according to the International Classification of Diseases and Related Health Problems \(ICD\-10\)\[clef2019nts,clef2019test,world1992icd\]\. A study by\[sanger2019classifying\]utilized the multilingual BERT version \(mBERT\) to classify these summaries, demonstrating that mBERT significantly outperformed a baseline Support Vector Machine model\. To incorporate advances in general German language models,\[lentzen2022critical\]introduced BioGottBERT, a model pre\-trained on open medical German texts from Wikipedia and scientific abstracts, which demonstrated superior performance over its generalized counterpart GottBERT on medical tasks\. Subsequently, the authors of\[bressem2024medbert\]proposed medBERT\.de, in order to address the limited training data size and narrow scope on merely one medical subarea by using 3\.8 million radiology reports, achieving promising results in classification tasks\. While BioGottBERT was trained on a relatively small corpus slightly less than 1 GB of text, medBERT\.de significantly expanded its training corpus to 10 GB, incorporating a wider variety of sources\. However, its BERT architecture has been improved by its optimized version RoBERTa as recently demonstrated for German by the GeistBERT model\[scheibleschmitt2025geistbertbreathinglifegerman\]\. GeistBERT reiterated on GottBERT\[scheible2020gottbert\], by using Whole Word Masking \(WWM\) and continued pre\-training on a significantly more varied and larger general\-domain corpus, thereby establishing state\-of\-the\-art performance on various German NLP benchmarks\.
## 3Methodology
### 3\.1Corpus Creation
Main shortcomings of existing German medical domain models include the limited availability of training data due to the sensitive nature of medical information and strict data privacy regulations\. Furthermore, many existing biomedical Transformer models\[lentzen2022critical,bressem2024medbert\]are pre\-trained or evaluated on proprietary datasets, hindering independent model verification and validation\. Previous studies\[martin2020camembert,dada2023impact,bressem2024medbert\]concluded that training data diversity and quantity are more important than excessive data cleaning, which insignificantly affected downstream performance\. Following these findings, we compiled a 13\.5 GB large and highly varying German biomedical and clinical corpus, focusing on data quantity over quality\. In order to mitigate the aforementioned shortcomings, we primarily relied on public datasets with only two private data sources included, to foster transparency and accessibility of the ChristBERT models\. Tab\.[1](https://arxiv.org/html/2606.03250#S3.T1)summarizes the pre\-training corpus, including descriptive statistics about the number of documents, sentences, words, and size of each incorporated dataset\.
Table 1:Overview of datasets contained in the pre\-training corpus\. The table provides details about each dataset, including the number of documents, sentences, words, and their size in megabytes\. The final corpus includes all listed datasets and amounts to roughly 13\.5 GB of pre\-training data\.#### 3\.1\.1Hpsmedia
Hpsmedia is a German publisher specializing in medical content primarily targeted at healthcare professionals\. Hpsmedia publishes three healthcare journalsPflegewissenschaften \(Nursing Sciences\),Pädagogik der Gesundheitsberufe \(Pedagogy of Health Professions\)andGeschichte der Gesundheitsberufe \(History of Health Professions\), which are available in print and online\. All journals publish articles in German and are peer\-reviewed by experts in the respective fields according to the international reviewing standard BMJ\[smith2006peer\]\. The articles cover a wide range of topics within the healthcare domain including aspects of health and nursing care, pedagogy, didactics, curricula, education in healthcare professions and the history of healthcare professions\. We were kindly provided with the full\-text content of the journals in CSV format by Hpsmedia\. The CSV files were processed using thePandas\[mckinney2010data\]Python library to extract the text content of the articles, which was then included in the pre\-training corpus\. The Hpsmedia dataset consists of 277,357 documents totaling to 3,117 MB of data\.
#### 3\.1\.2Springer Nature
Springer Nature is a prominent global publisher of academic content, known for its extensive collection of high\-quality journals, books, and research materials across various disciplines, including science, technology, and medicine\.
For the extraction of text from Springer Nature publications, the Springer Nature API\[springernature\]was utilized\. The API offers multiple endpoints, e\.g\. metadata, full\-text \(TDM\) as well as a wide range of constraint parameters to filter for desired publications, which are returned in XML format\. This allowed for a systematic filtering for open\-access publications in German\. For our purposes, the open\-access API was first queried for metadata of articles and books related to the subjects ofbiomedicine,public health,pharmacy,dentistryandlife sciences\. The returned XML data was then processed to extract abstracts and Digital Object Identifiers \(DOI\) of each publication, respectively\. Subsequently, the set of DOIs was used to make bulk API calls to the TDM endpoint to subsequently fetch the full\-text content of the publications\. In a final step, the extracted abstracts and full\-text content were both incorporated into the pre\-training corpus accounting for a total of 258,000 documents and 1,984 MB of data\.
#### 3\.1\.3PubMed Central
PubMed Central \(PMC\) is a free digital repository of full\-text scientific literature in the field of biomedicine and life sciences and created as an extension of PubMed\[white2020pubmed\], which holds bibliographic references and abstracts for essentially all publications in the biomedical sciences\. Both repositories are maintained by the National Center for Biotechnology Information \(NCBI\), a part of the United States National Library of Medicine \(NLM\)\. The PMC archive provides access to a collection of over 10 million research articles, reviews, and other scientific publications from a wide range of biomedical and life science journals\. Not all articles in PMC are available for text mining or other reuse as many are under copyright\. ThePMC Open Access Subset\[pmcoa\]contains those articles made truly freely available to the public under Creative Commons or similar licenses that allow more liberal redistribution and repurpose than the majority of licensed and copyrighted articles from subscription access journals deposited in PMC\. PMC stores content in XML format, which is structured according to the Journal Article Tag Suite \(JATS\) standard, a widely used archival markup format for journal articles\. The JATS XML files are made available by NLM for bulk download through their PMC FTP Service\[pmcoa\]\. We downloaded the December 2024 baseline package of the PMC Open Access Subset and transferred the XML files with appropriate metadata such as PubMed ID and publication date to aPostgreSQLdatabase for further processing\. The database design is shown in Fig\.[1](https://arxiv.org/html/2606.03250#S3.F1), which is represented by an entity\-relationship diagram\. The XML files and their corresponding metadata are stored in thexml\_documenttable by leveraging the native support ofPostgreSQLfor XML data types\. For our needs, we extracted the title, abstract, full\-text content and language of the articles from the XML markup by utilizing thePubmed Parser\[achakulvisut2020\]Python library, which supports parsing of the JATS XML format\. The extracted text data was then stored in thedocumenttable, which contains the PubMed ID and language of each document as its primary keys\. The language of each document is represented as a foreign key to thelangtable, which contains the ISO 639\-3 language codes\. Thelangtable is used to ensure data integrity and consistency across the database\.
Figure 1:The diagram shows the database relations for translation management\. The raw XML files with their metadata are stored in thexml\_documenttable, while thedocumenttable contains the extracted text data with both PubMed ID and the language of a document as its primary keys\. Languages are foreign keys to the corresponding entity in thelangrelation according to the ISO 639\-3 standard\.In order to leverage the large amount of English\-language content available in PMC, we translated the English articles to German using theNLLB 200\[costa2022no\]neural machine translation model in its 1\.3 billion distilled variant\. Translation was performed on two Nvidia GeForce RTX 3090 24 GB GPUs, while leveraging theNLLB\-API\[nllbAPI\]library for parallel processing\. The translation posed a significant computational challenge, which was addressed by limiting the publications to be translated to those published in the third and fourth quarters of 2020\. This decision was based on an analysis of article distribution over the past seven years, which is depicted in Fig\.[2](https://arxiv.org/html/2606.03250#S3.F2)\. The analysis revealed a notable peak in publications in 2022, potentially influenced by the COVID\-19 pandemic and the emergence of generative AI\. To ensure that the translated content was not overly biased towards the COVID\-19 pandemic and mitigate the presumably uniform writing style resulting from generative AI, we selected the year 2020 for our translations\. Likewise, given our computational constraints, we chose the third and fourth quarters of 2020, as the quarterly distribution of articles in that year indicated a more feasible volume of publications as seen in Fig\.[3](https://arxiv.org/html/2606.03250#S3.F3)\. Translated documents were saved back to the database in thedocumenttable, but with an updated language key set todefor German\. Further data filtering encompassed the removal of articles with less than 40 characters and those containingLaTeXmarkup\. Fig\.[4](https://arxiv.org/html/2606.03250#S3.F4)summarizes the described steps for PMC translation as a flowchart\. The translated and natively German articles were then combined into a single dataset, resulting in a total of 90,272 documents and 1,609 MB of text data\.
Figure 2:This histogram shows the annual number of articles published in PMC from 2018\-2024, with a fitted trend line indicating the overall growth and decline of article counts\. The estimated article count for Q4 2024 was approximated based on the maximum Q4 article count observed in previous years \(50,230\)\.Figure 3:The histogram shows the quarterly number of articles published in PMC in 2020, with a peak in Q1 at 201,279 articles, followed by a drop in the subsequent quarters\.Figure 4:This flowchart illustrates the sequential steps involved in the translation process of articles from PubMed Central \(PMC\)\. It begins with extracting article details fromPostgreSQLusingPubmed Parser, followed by language\-based filtering to select English articles from Q3 and Q4 of 2020\. Articles are then filtered based on length andLaTeXcontent, followed by translation to German using the NLLB 200 1\.3B distilled model\.
#### 3\.1\.4PhD Theses
In this work, we also included a collection of 7,486 open\-access German\-language dissertations and postdoctoral dissertations from Charité University Hospital, Germany’s largest university hospital\. At the joint medical faculty of Humboldt University and Free University of Berlin, electronically published documents, doctoral and habilitation theses, as well as research data are made available to the public through the university’s institutional repositoryRefubium\[refubium\]\. The documents were downloaded in bulk as PDF files and subsequently converted to plain text\. Data cleaning involved removing sentences that lacked German stop words and excluding theses under 15 pages in length\. This process ensured the inclusion of only relevant, high\-quality text data\. In total, 646 MB of text was extracted from the PhD theses\.
#### 3\.1\.5Medical Wikipedia
Wikipedia curates entry pages on the encyclopedia about particular subject areas in so\-calledportals\. Each portal acts as a hub, bringing together key articles, images, and resources about the respective topic\. Portals are particularly useful for getting an organized overview or exploring related subtopics without searching through individual articles\. We utilized the German Wikipedia portal on medicine in order to extend our pre\-training corpus with freely available texts on medical topics, which were collectively authored and editorially proofread by a diverse community of volunteer contributors\. Wikipedia does not offer an API for bulk data retrieval, but instead provides an export interface\[wikipediaexport\]for downloading specified wiki pages in a special XML format\. These XML files follow a schema specific toMediaWiki, the software behind Wikipedia, initially intended for importing into anotherMediaWikiinstallation but also allows for further processing and analysis\.
The export interface expects either a list of page titles or a category name, which it resolves into a list of pages related to the given category\. Since our objective is to crawl the entirety of the medical portal, we implemented a breadth\-first search algorithm on Wikipedia’s export interface, employingSelenium WebDriver\[gojare2015analysis\]to traverse the category tree of the portal\. The algorithm starts at the root categoryPortal:Medizinand recursively visits each subcategory, collecting the titles of all pages contained within\. The page titles are then used to download the corresponding XML files in bulk, creating a dump of the entire German medical portal\. TheMediaWikiXML files are parsed using Python’sElementTreemodule to extract page contents\. Wikitext formatting is then removed usingMediaWiki Parser from Hell\[kurtovic\_earwigmwparserfromhell\_2025\], resulting in clean plain text documents\. The German Wikipedia portal on medicine contributes a total of 75,585 documents and 362 MB of data to the pre\-training corpus\.
#### 3\.1\.6MIMIC\-IV Notes
Medical Information Mart for Intensive Care IV \(MIMIC\-IV\)\[johnson2023mimic\]is a large and freely accessible electronic health record dataset comprising various health\-related data acquired during routine clinical care of patients admitted to critical care units of the Beth Israel Deaconess Medical Center in Boston, MA, USA\. MIMIC\-IV constitutes the fourth edition of the dataset, containing data of over a decade from 2008 to 2019 and covering a wide range of information such as patient measurements, orders, diagnoses, procedures, treatments, and clinical notes\.
For our corpus, we specifically chose to utilize theclinical notes\[johnson2023mimicnote\]subset of the MIMIC\-IV database as it is made up of discharge summaries written in the form of free text, which is well suited for training contextual language models\. The 330,485 discharge summaries from 145,915 hospitalized patients are organized into sections including chief complaint, history of present illness, past medical history, brief hospital course, physical exams, and discharge diagnoses\. These free\-text notes were acquired from the hospital system and de\-identified by the authors using a combined automatic approach of custom rules and a neural network trained on de\-identification, cast as a NER task\.
The note subset of the MIMIC\-IV dataset is available onPhysioNet\[goldberger2000physiobank\], a repository for freely accessible medical data and tools for computational medicine research\. After downloading the collection of clinical notes, we utilized LLMs to translate the English discharge summaries to German\. Specifically, we employed the multilingualLLaMA 3\.1 8B\[dubey2024llama3\]model in an API\-like manner by providing the prompt as shown in Tab\.[2](https://arxiv.org/html/2606.03250#S3.T2)\. The translated notes were then included in the pre\-training corpus, consisting of 330,485 documents and 5,310 MB of data\.
System Prompt:You are an API\-like assistant, and output only the plain response without further explain or comment the output\.User Instruction:Translate the following text strictly into German\. Do not replace the \_\_\_ pseudonymization masks\.<English Text\>Table 2:LLaMA 3\.1 system prompt and user instruction used for MIMIC\-IV translation
#### 3\.1\.7Web Crawl
To enrich our corpus with current medical content from the German web, a web crawl was performed using the implementation described in\[deng2025crawler\], which extends the open\-source crawlerApache Nutch\[khare2004nutch\]\. The crawl was seeded with a combination of domains from thetala\-med search\[specht2025evaluating\]index as well as the seed sources provided by thesampled German Health Web\(sGHW\)\[zowalla2020crawling\]project\. Tala\-med search is a specialized search engine that provides high\-quality, evidence\-based health information\. In its current version, it indexes 26 trustworthy German health websites and ensures strict user privacy\. The sGHW project represents previous efforts to index health\-related web content in the German language and employed a specialized focused crawler to create an index of 22,405 German health websites\. The sGHW index was limited to websites with\.de,\.at, and\.chtop\-level domains, and used a support vector machine to filter content for health relevance automatically\. Our crawl was configured with parametersdepth=3andtopN=100\. In web crawling,depthrefers to the number of hops or iterations the crawler will follow links from the seed URL, while the width, calledtopN, specifies the maximum number of URLs to fetch in each iteration\. These parameters control the crawling process and were chosen to allow for systematic exploration of linked content while maintaining a manageable scope\.
Despite the focused seed list, unsuitable as well as nonmedical content, such as advertisements, remained present in the crawl data due to the nature of web crawling\. To address this issue, we developed a text classifier in order to filter medical and scientific content from general web content, ensuring the relevance of the gathered data\. The classifier was built by fine\-tuning GeistBERT on a binary\-labeled dataset derived from the scientific portion of the10kGNAD\[10kGNAD\]corpus\. The 10kGNAD dataset is a subset of theOne Million Posts Corpus\[schabus2017one\]and consists of 10,000 German news articles, including 573 focused on scientific topics\. These scientific articles make up the first half of the fine\-tuning dataset, while the second half was created using a stratified sample to ensure a balanced dataset and that each category was proportionally represented\. The classifier’s performance was evaluated using a manually labeled subset of the web crawl data of size 119, which was annotated usingLabel Studio\[labelstudio\], an open\-source data labeling tool\. On this test set, the classifier achieved an F1score of 80\.34%, indicating a reliable level of accuracy\. Following this validation, we applied the classifier to filter the complete web crawl dataset\. After filtering, we removed documents from the web crawl with less than 40 characters, those containing the Unicode replacement characterU\+FFFDdue to encoding issues, and duplicates\. For the remaining documents, we removed phone numbers, email addresses, URLs and emojis utilizing theclean\-text\[clean\-text\]Python library\. The preprocessing of the web crawl resulted in a final collection of 93,642 documents and 512 MB of data\. Both the classifier\[christbertscignad\_tcls\_2024\]and the scientific subset used for training, referred to assciGNAD\[christbertscignad\_2024\], are publicly released to support reproducibility and downstream research\.
### 3\.2Pre\-Training
Leveraging the created large\-scale biomedical corpus as described in Sec\.[3\.1](https://arxiv.org/html/2606.03250#S3.SS1)as well as the architectural foundation laid by the current state\-of\-the\-art German general\-purpose language model GeistBERT, we developed biomedical adaptations by following three main strategies:
1. 1\.Continued pre\-training: Starting from the checkpoint of GeistBERT, we initialize a RoBERTa base model with the identical weights and general domain vocabulary\. Subsequently, all parameters of the model are retrained on our 13\.5 GB training data as listed in Tab\.[1](https://arxiv.org/html/2606.03250#S3.T1)\. Essentially, this approach is equivalent to extending GeistBERT’s pre\-training dataset with the new data, which is why this strategy is known ascontinuedpre\-training\. The created model following this approach will be referred to as ChristBERT\.
2. 2\.Pre\-training from scratch: We also explored the possibility of pre\-training a RoBERTa model from scratch using the same architecture and vocabulary as GeistBERT, but without any initialization from the general domain model\. As a result, this model solely learns language representations from our biomedical corpus\. We denote this model as ChristBERTscratch\.
3. 3\.Vocabulary adaptation: In order to study the impact of a domain\-specific vocabulary, this strategy involves the creation of a new vocabulary based on the created biomedical corpus and follows the same pre\-training process as ChristBERTscratch\. The vocabulary is generated analogously to GeistBERT, using a GPT\-2 style byte pair encoding \(BPE\) tokenizer with a target vocabulary size of 52,000 tokens\. The resulting model is referred to as ChristBERTBPE\.
Each of the three models was pre\-trained using thefairseq\[ott2019fairseq\]framework on the domain\-specific corpus presented in Sec\.[3\.1](https://arxiv.org/html/2606.03250#S3.SS1), which amounts to 13\.5 GB of uncompressed text data\. The documents comprising the training data, as listed in Tab\.[1](https://arxiv.org/html/2606.03250#S3.T1)were shuffled in order to improve pre\-training robustness\. The models underwent training for 100,000 update steps with a batch size of 8,192, utilizing weight initialization based on one of the three previously outlined strategies\. We adapted GeistBERT’s pre\-training configuration, which closely aligns with RoBERTa’s standard training setup\[liu2019roberta\], encompassing dynamic masking for the WWM learning objective,AdamWoptimizer parameters, and a fixed sequence length of 512 tokens\. To comply with the maximum input sequence length of the model, full sentences from multiple documents in the pre\-training corpus were packed into text segments\. This procedure allows for retention of natural sentence structure despite the use of fixed\-length sequences\. For efficient data access, thefairseqlibrary converts the input data into a binary format and utilizes memory\-mapped file I/O\. A warmup phase of 10,000 iterations was implemented, gradually increasing the learning rate to a maximum of7×10−47\\text\{\\times\}\{10\}^\{\-4\}for ChristBERT and6×10−46\\text\{\\times\}\{10\}^\{\-4\}for ChristBERTscratchand ChristBERTBPE, followed by a polynomial decay to zero\. The complete pre\-training procedure was performed on clusters equipped with either four Nvidia A100 interconnected via SXM or two Nvidia H100 GPUs\. The cumulative training time for the three models amounted to approximately 21\.7 days \(refer to Tab\. S2 in the Supplementary Material\)\.
### 3\.3Language Modeling Evaluation
To assess the impact of different pre\-training strategies, we evaluate the intrinsic language modeling performance of our models usingperplexity\[bengio2003neural\]\. Perplexity is a widely used metric that quantifies how well a language model predicts a sequence of words; lower values indicate better generalization and more confident predictions of unseen text\. The perplexity \(commonly abbreviated as ppl\) of a modelθ\\thetaon a test set𝒲\\mathcal\{W\}is defined as the inverse probability thatθ\\thetaassigns to𝒲\\mathcal\{W\}, normalized by the test set length\. More formally, for a sequence ofnnwordsw1:n=\(w1,…,wn\)w\_\{1:n\}=\(w\_\{1\},\\ldots,w\_\{n\}\), the perplexity is given by:
pplθ\(w1:n\)\\displaystyle\\text\{ppl\}\_\{\\theta\}\(w\_\{1:n\}\)=Pr\(w1:n\)−1nθ\\displaystyle=\\Pr\{\}\_\{\\theta\}\(w\_\{1:n\}\)^\{\-\\frac\{1\}\{n\}\}\(1\)=1Pr\(w1:n\)θn\\displaystyle=\\sqrt\[n\]\{\\frac\{1\}\{\\Pr\{\}\_\{\\theta\}\(w\_\{1:n\}\)\}\}\(2\)
We can use the chain rule of probability to express the perplexity of a sequence of words as the product of the probabilities of each word given its preceding words:
pplθ\(w1:n\)=∏i=1n1Pr\(wi\|w1:i−1\)θn\\text\{ppl\}\_\{\\theta\}\(w\_\{1:n\}\)=\\sqrt\[n\]\{\\prod\_\{i=1\}^\{n\}\\frac\{1\}\{\\Pr\{\}\_\{\\theta\}\(w\_\{i\}\|w\_\{1:i\-1\}\)\}\}\(3\)
Note that due to the inverse relationship in Eq\.[1](https://arxiv.org/html/2606.03250#S3.E1), higher probabilities assigned to word sequences correspond to lower perplexity values\. Consequently, a model with lower perplexity indicates that it is a better predictor of the given test set\. Minimizing perplexity is equivalent to maximizing the probability of the test set as predicted by the LM\.
### 3\.4Downstream Task Evaluation
With minimal adjustments to its architecture, pre\-trained LMs can be adapted for downstream applications by fine\-tuning them on task\-specific datasets\. Fine\-tuning involves adding task\-specific layers or adaptation heads that process the model’s hidden representations\. The fine\-tuning process consists of continued training using labeled data from supervised datasets to adjust the weights of both the pre\-trained model and the task\-specific layers added on top\.
To demonstrate the efficacy of our domain\-adapted model in biomedical language modeling, we fine\-tune and evaluate the ChristBERT models on two common biomedical downstream tasks: Named entity recognition \(NER\) and text classification\.
#### 3\.4\.1Named Entity Recognition
NER is used to extract relevant text spans, such as mentions of diagnoses or medications, from clinical text\. We evaluated NER performance on three German biomedical corpora covering oncology and cardiology domains\.BRONCO150\[kittner2021bronco150\]consists of anonymized sentences from 150 discharge summaries labeled with the categoriesMedication,Treatment, andDiagnosis\.GGPONC\[borchert2022ggponc\]is a large\-scale corpus derived from German oncology guidelines, containing over 200,000 named entities\. There are two major versions of the corpus, from which we used the second version\. For our experiments, we selected the most challenging configuration with fine\-grained labels and long entity spans to ensure comparability with prior work\[bressem2024medbert\]\. Finally,CARDIO:DE\[cardiode\]comprises 500 cardiovascular discharge letters annotated with six medication\-related entity types\. We excluded experimental sublabels from CARDIO:DE due to their low inter\-annotator agreement\.
#### 3\.4\.2Text Classification
Text classification refers to assigning one or more labels to a document based on its content\. We evaluated model performance on two multi\-label German biomedical classification tasks:
CLEF eHealth 2019\[crestani2019experimental,clef2019nts,clef2019test\]consists of 8,385 German non\-technical summaries \(NTS\) of planned animal studies from the AnimalTestInfo database\[animaltestinfo\]\. Each summary is annotated with zero or more ICD\-10 codes\. We followed prior work\[lentzen2022critical\]in filtering out rare classes \(fewer than 25 occurrences\), resulting in 5,688 documents and 230 classes\.JSynCC\[lohr2018sharing\]contains 867 synthetically generated German case reports from 10 medical textbooks, each annotated with one or more medical specialties\. To address class imbalance, we again retained only frequently occurring labels, reducing the dataset to 534 documents and 6 classes, includingTrauma Surgery,Anesthesiology, andOrthopedics\.
Both datasets exhibit significant label imbalance and are treated as multi\-label classification problems\.
#### 3\.4\.3Evaluation Metrics
We report standard classification metrics: precision, recall, and F1score\. Following common practice in biomedical NER and multi\-label classification\[tjong2003introduction,harbecke2022only\], we used micro\-averaged F1as our primary metric to account for class imbalance and capture overall model performance\.
#### 3\.4\.4Dataset Preparation
For all NER benchmarks, we employed theBigBIO\[fries2022bigbio\]library, which provides harmonized dataset schemas, standardized IOB2 entity annotations, and consistent data access tooling for biomedical NLP\. For text classification tasks, we used theHuggingface Datasetslibrary\[lhoest2021datasets\], which also serves as a foundation for BigBIO\. Whenever available, we preserved the official training, validation, and test splits\. For datasets without predefined splits, namely BRONCO150, CARDIO:DE, and JSynCC, we applied stratified random partitioning, allocating 80%, 10%, and 10% of the data to training, validation, and testing, respectively\. Figure[5](https://arxiv.org/html/2606.03250#S3.F5)illustrates the label distribution across splits for each benchmark\. We exclude the CLEF eHealth 2019 dataset from this figure, as it contains 230 possible classes\.
\(a\)BRONCO150
\(b\)CARDIO:DE
\(c\)GGPONC withfine\-grained entity classes andlongannotation spans
\(d\)JSynCC
Figure 5:Entity and class distributions of the downstream tasks
#### 3\.4\.5Experimental Setup
To evaluate downstream performance, we conducted fine\-tuning experiments with hyperparameter optimization on each task\. Specifically, we performed a grid search over batch size and learning rate, as detailed in Table[3](https://arxiv.org/html/2606.03250#S3.T3), yielding 28 trials per task\. The search space is based on the GeistBERT evaluation setup\[scheibleschmitt2025geistbertbreathinglifegerman\]and extended with additional learning rate values\. Each trial used a warmup step ratio of 10% and trained for up to 30 epochs\. The best model checkpoint was selected based on validation set performance\. For both NER and classification tasks, we reportmicro\-averagedprecision, recall, and F1scores on each benchmark’s test set\.
Table 3:Hyperparameters used in the grid search for the downstream tasksUnlike perplexity \(see Sec\.[3\.3](https://arxiv.org/html/2606.03250#S3.SS3)\), which evaluates intrinsic language modeling ability, downstream task performance enables direct comparison of model efficacy across architectures and domains\. To benchmark the ChristBERT models, we selected four state\-of\-the\-art \(SOTA\) German Transformer\-based language models—two domain\-specific and two general\-purpose baselines \(Table[4](https://arxiv.org/html/2606.03250#S3.T4)\)\. All model\-task combinations underwent identical hyperparameter tuning and evaluation procedures for fair comparison\. A brief overview of each baseline is provided below; additional architectural details are listed in Table S1 in the Supplementary Material\.
Table 4:Architecture, domain, and corpus size of evaluated models\. For ChristBERT, BioGottBERT and GeistBERT, corpus size indicates the size of the initial \+ continuous pre\-training corpus\.##### medBERT\.de
is based on the BERT\[devlin2019bert\]base architecture and is specialized for the German medical domain\. Similar to ChristBERTBPE, it was trained from scratch with a custom domain\-specific vocabulary on a large and diverse 10\.3 GB corpus, comprising 4\.7 million German medical documents from eleven different sources, including articles from the German health web, scientific texts, medical books, and real\-world clinical data such as electronic health records and radiology reports from Charité University Hospital\. This substantial dataset translated into SOTA performance on various medical benchmarks, particularly for longer and more complex texts, such as NER and ICD\-10 chapter classification from radiology discharge summaries and surgical reports\.
##### BioGottBERT
is a domain\-adapted variant of the unfiltered base version of GottBERT\[scheible2020gottbert\], a RoBERTa\-based model trained on the German portion of the OSCAR corpus\[suarez2019asynchronous\]\(145 GB of general text\)\. Similar to ChristBERT, BioGottBERT wascontinuously pre\-trainedon 809 MB of biomedical German texts, specifically from Wikipedia, scientific abstracts and drug leaflets\. Despite the small biomedical corpus, BioGottBERT demonstrated notable improvements over its general\-domain counterpart on a variety of medical NLP tasks, including NER and classification problems\. Our benchmark selection closely follows that of BioGottBERT\. The shared tokenizer enables direct comparability between ChristBERT and BioGottBERT across all tasks\.
##### GeistBERT
is a general\-domain German language model based on the RoBERTa base architecture\. It follows acontinued pre\-trainingapproach, initializing from the best checkpoint of filtered GottBERT\[scheible2020gottbert\]\(94,530 steps\), and extending it with 100,000 further training steps using WWM\. The training corpus spans 1\.3 TB of partially deduplicated German data, including crawled web text and publicly accessible legal documents\[nguyen2023culturax,tiedemann2012parallel\]\. GeistBERT achieves SOTA results across NER, classification, and natural language inference tasks, outperforming even larger models\. GeistBERT serves as the general\-domain reference for assessing the impact of domain\-specific pre\-training\.
##### GeBERTa
is another general\-domain model that employs the DeBERTa\[he2021deberta\]base architecture, featuringdisentangled attentionfor improved contextual representation\. It was trained on 167 GB of heterogeneous German data, including formal, informal, legal, medical, and literary text\. GeBERTa has been evaluated on general and medical NER, sentiment analysis, hate speech detection, and question answering tasks\. As a non\-RoBERTa baseline, it allows comparison of architectural effects and cross\-domain training data on downstream performance\.
#### 3\.4\.6Implementation Details
All fine\-tuning experiments were conducted using theHuggingface Transformers\[wolf2020transformers\]Python library and theNeural Network Intelligence\[nni\]framework for hyperparameter tuning\. The choice of libraries was reinforced by their native support for the dataset implementations\[fries2022bigbio,lhoest2021datasets\]\. For classification, model inputs were tokenized with truncation to, and padding up to the maximum length of 512 tokens\. In the case of NER, longer sequences were split into one or more sequences of 64 tokens except for GGPONC, which was split into 128 tokens\. These values correspond to the 95\-th percentile of the sequence lengths across each dataset’s training, validation and test splits\. The evaluation metrics were computed using theseqeval\[seqeval\]Python library for NER and thesklearn\[pedregosa2011scikit\]Python library for classification\. Inseqeval,strictevaluation mode was applied, measuring both the correctness of the entity boundary and the entity class\. All experiments were conducted on consumer\-grade hardware, specifically an NVIDIA RTX 3090 GPU with 24 GB VRAM, ensuring reasonable training times for practitioners\. The total computation time for all experiments encompassing all 35 model\-dataset combinations, amounted to approximately 6\.74 days \(refer to Tab\. S3 in the Supplementary Material\)\.
## 4Results
### 4\.1Pre\-Training Performance
Fig\.[6](https://arxiv.org/html/2606.03250#S4.F6)plots the perplexity of the three ChristBERT models during pre\-training over 100,000 training steps\. Perplexity was evaluated on a held\-out validation set of 3,000 randomly chosen documents from our pre\-training corpus listed in Tab\.[1](https://arxiv.org/html/2606.03250#S3.T1)\. We observe that the different domain adaptation strategies are reflected in the perplexity trajectories in terms of initial perplexity, rate of decline and convergence behavior\.
Figure 6:Perplexity during pre\-training of ChristBERT models\. Perplexity is shown in log scale for every optimization step and evaluated on the validation split of the pre\-training corpus\.#### 4\.1\.1Initial Perplexity and Rate of Decline
Initially, the two ChristBERT variants pre\-trained from scratch exhibit high perplexity values of57434\.557434\.5and56343\.356343\.3, while the continuously pre\-trained ChristBERT starts with a lower perplexity of12\.6412\.64\. The lower perplexity directly results from the model being initialized with the weights of GeistBERT, demonstrating the effectiveness of transfer learning\. As pre\-training progresses, perplexity decreases steeply during the first 10,000 steps for ChristBERTscratchand ChristBERTBPE, with the rate of perplexity reduction following a non\-linear pattern across all variants\. The steepest reduction occurs within the first 5,000 steps, where we observe perplexity values dropping approximately two orders of magnitude from∼103\{\\sim\}10^\{3\}to∼101\{\\sim\}10^\{1\}\. The observed perplexity curve can be attributed to the learning rate schedule, in which the learning rate is linearly increased for 10,000 iterations to its maximum\. After this warmup phase, the perplexity trajectory flattens considerably between steps 10,000 and 40,000\. Model perplexity continues to decrease but at a substantially slower rate, which is due to the learning rate following a polynomial decay to zero after the first 10,000 steps\.
#### 4\.1\.2Convergence Behavior
The continuously pre\-trained ChristBERT model converges the fastest, stabilizing at around a perplexity of 3\-4 by 10,000 steps and maintaining the lowest perplexity throughout pre\-training\. Additionally, ChristBERT consistently achieved lower perplexity than both BPE and Scratch variants, with perplexity values 30\-50% lower during the middle stages of pre\-training\. Despite this, after around 40,000\-50,000 iterations, all models reach a relatively stable perplexity level between 2\-4\. Diminishing returns are observed after 60,000 steps, suggesting extended pre\-training offers minimal improvements and convergence is achieved\. Moreover, we observed divergence in some pre\-training runs, particularly due to high learning rates where model parameters were updated too aggressively\. This divergence manifested as sudden spikes in perplexity and subsequent failure to converge is shown in Fig\. S1 in the Supplementary Material\. To mitigate divergence, we found that lowering the learning rate was effective in stabilizing pre\-training\. This was necessary for ChristBERTscratchand ChristBERTBPE, where we reduced the maximum learning rate from7×10−47\\text\{\\times\}\{10\}^\{\-4\}to6×10−46\\text\{\\times\}\{10\}^\{\-4\}, while with the GeistBERT initialization it was possible to use a higher peak learning rate\.
ChristBERTBPEconsistently demonstrates the highest perplexity values among the three variants, particularly during the middle phase of pre\-training\. This effect likely stems from its custom byte\-pair encoding vocabulary, where it spends the middle stages learning different language representations\. As discussed in Sec\.[3\.3](https://arxiv.org/html/2606.03250#S3.SS3), perplexity comparisons are most meaningful between models sharing the same tokenizer; therefore this difference does not necessarily indicate inferior model quality\. Moreover, an improvement in an intrinsic measure such as perplexity does not necessarily correlate with enhanced performance in extrinsic measures such as practical downstream language tasks\. Nevertheless, perplexity remains a useful proxy for estimating a model’s generalization capacity and its potential effectiveness on downstream tasks\.
### 4\.2Fine\-Tuning Performance
#### 4\.2\.1Named Entity Recognition
Tab\.[5](https://arxiv.org/html/2606.03250#S4.T5)shows the performance results of medical named entity recognition on the BRONCO150, CARDIO:DE and GGPONC datasets\. Detailed results for each entity type in the respective dataset are reported in Tab\. S7\-S9 in the Supplementary Material\. The ChristBERT models consistently outperform the baseline models across all datasets, establishing a new state\-of\-the\-art German biomedical NER\.
Table 5:Overview of micro averaged precision \(Prec\.\), recall \(Rec\.\) and F1scores on the NER tasks\. All results are shown in percent and assess each model’s best fine\-tuned performance on each downstream task’s test set\. The best model was selected out of 28 runs based on its validation set performance\. Best score in bold and second best underlined\.On the BRONCO150 dataset, ChristBERTBPEachieves the highest precision \(85\.71%\), recall \(82\.32%\) and F1score \(84\.74%\), forming a substantial improvement over both specialized medical models and general language models\. ChristBERTscratchplaces second with an F1of 83\.33%, followed closely by ChristBERT with 81\.87%\. The performance delta between ChristBERT variants and other models is particularly evident when comparing against the general language models\. For instance, the F1score of ChristBERTBPEwith 84\.74% represents a 5\.62 percentage point improvement over GeistBERT \(77\.69%\) and a 5\.08 percentage point improvement over GeBERTa \(79\.12%\)\. This significant performance gap underscores the value of domain\-specific pre\-training for NER in medical texts\.
The performance on the CARDIO:DE dataset presents a different pattern of results\. Here, all compared models showed more similar performances, with ChristBERTBPEperforming the best, followed closely by GeBERTa, which leads among the baselines and demonstrates on par NER efficacy\. Both mentioned models achieve high F1scores of 90\.40% and 90\.37%, respectively, differing in precision and recall\. This dataset highlights the potential for general language models to perform competitively in certain medical subdomains when trained appropriately\.
The GGPONC dataset presents the most challenging evaluation scenario with eight fine\-grained semantic classes and long entity spans across a large corpus of oncology documentation\. On this complex dataset, ChristBERT models again demonstrate superior performance compared to the baseline models, with ChristBERT achieving the highest recall \(79\.83%\), while ChristBERTBPEattains the highest precision \(76\.59%\)\. Here, ChristBERTBPEand ChristBERTscratchmatch each other’s precision, recall and F1scores\. The performance advantage of our pre\-training corpus on GGPONC is particularly noteworthy given the complexity of this dataset\. With an F1of 77\.69%, ChristBERT is the best performing NER model and outperforms the next best non\-ChristBERT model GeBERTa at 76\.45% by 1\.24 percentage points\. The demonstrated advantage in the most complex dataset suggests that the domain\-specific pre\-training of ChristBERT models enables more effective learning of the nuanced entity boundaries and semantic distinctions required for fine\-grained medical entity recognition\.
#### 4\.2\.2Text Classification
Tab\.[6](https://arxiv.org/html/2606.03250#S4.T6)presents the classification results for each model on the CLEF and JSynCC classification datasets\. Detailed results for each topic category in JSynCC can be found in Tab\. S9 in the Supplementary Material\. We omit a separate per class drill\-down for the CLEF dataset as it contains over 230 classes\. As such, the CLEF benchmark poses the more challenging multi\-label classification task, while JSynCC only requires assigning labels out of six medical categories\.
Table 6:Overview of micro averaged precision \(Prec\.\), recall \(Rec\.\) and F1scores on the classification tasks\. All results are shown in percent and assess each model’s best fine\-tuned performance on each downstream task’s test set\. The best model was selected out of 28 runs based on its validation set performance\. Best score in bold and second best underlined\.On the CLEF dataset, GeBERTa achieves the highest F1score at 89\.31%, driven by its superior recall at 89\.71%\. Nonetheless, ChristBERTscratchdemonstrates the highest precision \(93\.68%\), indicating that it is more effective at minimizing false positives\. However, its recall \(85\.17%\) is lower than GeBERTa’s, resulting in a lower overall F1at 89\.22%\. To our surprise, we observe that both general and domain\-specific models perform similarly on this dataset\. Notably, the continuously pre\-trained ChristBERT variant shows the lowest overall performance among the evaluated models on this dataset with an F1of 76\.03%\. Its performance differs by 4\.61 percentage points from the next best model GeistBERT \(80\.64%\), its general domain counterpart\. This suggests that the continuous pre\-training approach may not be as effective for complex multi\-label classification problems, particularly when compared to the other ChristBERT variants, which were pre\-trained with the same corpus but with different initialization strategies\.
It should be noted that GeBERTa included CLEF data in its pre\-training corpus, meaning it had already seen this data before evaluation\. This might explain its exceptionally high performance compared to other models and should be considered when interpreting these results\. Even so, medBERT\.de achieves strong performance with an F1score of 88\.40%, demonstrating that domain adaptation across different medical subdomains supports the processing of specialized terminology and concepts in animal experiment documentation\.
On the JSynCC dataset, the majority of ChristBERT models considerably outperform the baseline models, with ChristBERTscratchachieving the highest F1score of 94\.61%, closely followed by ChristBERT at 94\.19% and a shared third place between GeistBERT and GeBERTa at 92\.59%\. A particularly striking observation is the perfect recall \(100%\) of ChristBERT on the JSynCC dataset, indicating that it identifies all relevant specialty classifications across the test documents\. However, its precision \(89\.01%\) is lower than other models, resulting in an F1of 94\.19%\. This pattern suggests that ChristBERT may be over\-predicting certain class labels, but its comprehensive coverage ensures no relevant classifications are missed, a characteristic that could be valuable in clinical applications where missing a relevant specialty category might have significant consequences\.
The performance clustering on JSynCC is notably tight, with all models achieving F1scores between 92\.59% and 94\.61%\. Notably, BioGottBERT achieves the second\-highest overall performance on JSynCC with an F1of 93\.57% and recall of 98\.77%\. This suggests that the synthetic nature of this corpus may present more standardized linguistic patterns that various model architectures can effectively learn during fine\-tuning\. Furthermore, while ChristBERTBPEhas consistently shown the best performance in NER tasks, it does not rank among the top models on all classification benchmarks\. This indicates that the BPE vocabulary may not be as effective for text classification tasks, where the model’s ability to generalize across different contexts and semantic meanings is crucial\.
#### 4\.2\.3Cross\-Model Analysis and Domain Specialization Effects
Among the ChristBERT variants, ChristBERTBPEconsistently demonstrates strong performance across all NER datasets, achieving the highest or second\-highest F1scores in each experiment\. This suggests that the custom BPE vocabulary approach may offer advantages for handling the morphological complexity and specialized vocabulary found in German medical texts\. Despite its seemingly weaker performance during pre\-training as indicated by higher perplexity values, its downstream performance confirms that pre\-training metrics do not necessarily translate into task\-specific effectiveness\.
ChristBERTscratchalso performs competitively across NER datasets, indicating that domain\-specific training from initialization can be effective without leveraging transfer learning from general domain pre\-training\. The continuously pre\-trained ChristBERT model shows particular strength in the GGPONC dataset, suggesting it may have advantages for handling complex, fine\-grained entity recognition tasks\.
The comparison between specialized medical models \(ChristBERT variants, medBERT\.de, BioGottBERT\) and general language models \(GeistBERT, GeBERTa\) reveals distinct performance behavior in medical NER\. In BRONCO150 and GGPONC, domain\-specific models generally outperform general models, confirming the value of specialized pre\-training for oncology text\. However, in CARDIO:DE, GeBERTa achieves the highest F1, suggesting that general language models can be competitive in certain medical subdomains when trained on heterogeneous and cross\-domain data\. Notably, 8% of GeBERTa’s pre\-training data consisted of medical texts\.
This variability illustrates that domain specificity presents different advantages depending on the particular medical subdomain and entity types being targeted\. The general language models appear more competitive on CARDIO:DE, possibly due to differences in writing style, terminology standardization, or entity class definitions between cardiovascular and oncology domains\. Interestingly, we observe GeistBERT exhibiting equivalent performance to the domain\-adapted model BioGottBERT\. We attribute this mainly to the relatively small size of BioGottBERT’s biomedical training corpus \(0\.8 GB\), highlighting the importance of corpus size in achieving effective domain adaptation\.
An analysis of precision and recall values reveals different optimization patterns across models\. ChristBERTBPEtends to favor precision over recall in BRONCO150 and GGPONC, while achieving high values in both metrics for CARDIO:DE\. In contrast, the continuously pre\-trained ChristBERT shows stronger recall performance, particularly in GGPONC\. These trade\-offs have important implications for clinical applications, where the relative importance of precision versus recall may vary based on the specific use case\.
For the classification tasks, a complementary pattern emerges\. While ChristBERTBPEdominated in NER, it was outperformed by ChristBERTscratchand the baseline GeBERTa on both CLEF and JSynCC\. This suggests that the advantages of byte\-pair encoding may not generalize equally across all task types\. In contrast, ChristBERTscratchdelivered consistently strong results in both precision and recall, particularly excelling on JSynCC, which implies that full pre\-training on domain\-specific corpora enables robust feature representations for document\-level tasks\.
The continuously pre\-trained ChristBERT variant showed the weakest classification performance, likely due to residual biases from general\-domain pre\-training interfering with adaptation to complex, multi\-label classification setups like CLEF\. Interestingly, despite its poor performance on CLEF, this variant achieved perfect recall on JSynCC, underscoring that continued pre\-training can support comprehensive label coverage but may lead to over\-prediction and reduced precision\.
## 5Discussion
### 5\.1General Findings
In this study, we systematically explored three complementary strategies for domain adaptation of German biomedical language models: continued pre\-training from a general\-domain model \(ChristBERT\), pre\-training from scratch \(ChristBERTscratch\), and vocabulary adaptation via domain\-specific subword tokenization \(ChristBERTBPE\)\. All models were pre\-trained on a newly curated 13\.5 GB biomedical corpus and evaluated on downstream biomedical tasks, including NER and text classification\. Our experiments reveal several principal findings regarding the three investigated domain adaptation strategies\.
First, continued pre\-training proved particularly effective in terms of efficiency\. ChristBERT achieved the lowest perplexity and converged fastest, underscoring the benefits of leveraging general\-domain knowledge\. This advantage, however, did not always translate into downstream superiority for initializing biomedical models\. Instead, its performance varied among downstream tasks: While ChristBERT excelled in NER, particularly on GGPONC, it ranked lowest on complex classification tasks such as CLEF, indicating that inherited general\-domain priors may not always be beneficial for complex classification tasks\.
Second, pre\-training from scratch led to robust and often superior downstream performance\. ChristBERTscratchachieved top results on text classification tasks, particularly JSynCC, where it attained the highest F1score\. This suggests that domain\-exclusive representations learned from scratch may offer advantages in classification scenarios requiring broader semantic coverage and contextual generalization\.
Third, domain\-specific vocabulary adaptation \(ChristBERTBPE\) yielded the strongest performance for entity\-centric tasks\. Despite higher perplexity during pre\-training, this variant excelled in NER tasks across all datasets, achieving state\-of\-the\-art results on BRONCO150 and CARDIO:DE\. However, its performance in classification tasks was less competitive, indicating that the benefits of domain\-optimized tokenization are most pronounced in tasks sensitive to terminological precision and morphological complexity\.
Finally, comparisons to general\-purpose language models highlighted the importance of domain adaptation\. While general models such as GeistBERT and GeBERTa remained competitive on certain datasets like CARDIO:DE and CLEF, they were consistently outperformed by the ChristBERT variants on more specialized or complex biomedical tasks\. Furthermore, smaller\-scale domain adaptation efforts \(e\.g\., BioGottBERT\) could not match the performance gains achieved through our larger corpus and comprehensive pre\-training strategies\.
In summary, our findings emphasize that no single adaptation strategy universally outperforms the others\. Continued pre\-training offers rapid convergence and strong generalization\. From\-scratch pre\-training provides robust performance for classification, while additional domain\-specific vocabulary is most beneficial for specialized tasks like NER\. Our results highlight that the suitability of domain\-specific tokenization strategies, such as a custom BPE vocabulary, is highly task\-dependent\. This suggests that domain\-specific BPE tokenization is especially beneficial for entity recognition, where accurate boundary detection and handling of rare terms are critical\. In contrast, classification tasks often rely more on the model’s ability to generalize over broader semantic and syntactic patterns rather than fine\-grained tokenization\. Thus, in such contexts, the rigid subword splits introduced by domain\-specific BPE may offer less benefit, or even introduce unnecessary complexity\. These observations emphasize the importance of aligning vocabulary adaptation strategies not only with the domain but also with the linguistic properties and demands of the target task\.
### 5\.2Findings in the Context of Prior Work
Our findings align well with and extend previous work on domain\-adaptive pre\-training\. The study\[gururangan2020don\]demonstrated that continued pre\-training yields significant gains for domain\-specific tasks, especially when the target domain is distant from the original pre\-training corpus\. Our results confirm this for German biomedical NLP: continued pre\-training \(ChristBERT\) led to rapid convergence and strong performance in complex NER tasks like GGPONC\. Furthermore, previous findings from\[el2022re\]suggest that training from scratch can be competitive with, or even outperform, continued pre\-training on biomedical classification tasks\. Our ChristBERTscratchmodel demonstrated this by excelling on both the JSynCC and CLEF classification benchmarks\. In their experiments, the authors of\[el2022re\]also observed that medical\-specific vocabularies lead to performance gains in downstream domain tasks\. Our results mirror these prior observations, with ChristBERTBPEachieving top results in NER, reinforcing the idea that domain\-aligned vocabulary improves handling of specialized terminology\.
Inspired by\[edunov2018understanding\], we translated English medical texts into German to address the scarcity of native\-language biomedical corpora\. This strategy proved effective in terms of downstream task performance compared to medBERT\.de\[bressem2024medbert\], which relied exclusively on original German data\. GeBERTa\[dada2023impact\], which also leveraged translated medical texts, achieved similarly strong results, particularly in classification scenarios\. Notably, even general\-purpose models performed competitively on classification tasks, underscoring that large\-scale general\-domain pre\-training enriched with some biomedical content remains a viable approach for such tasks\. Nevertheless, our findings support the approach of translation\-based corpus construction, especially for tasks like biomedical NER, where domain\-specific nuances and terminology require targeted representation learning and original German resources remain limited\.
While implementing the translation strategy for MIMIC\-IV with LLaMA 3\.1 and Pubmed Central with NLLB 200, similar to\[dada2023impact\]we also observed that the quality of the machine\-translated data was sensitive to translation settings\. In particular with NLLB 200, we noticed that larger context sizes and sequence lengths frequently resulted in degraded translation quality\. Phenomena such as stuttering and incoherent phrase repetition became evident, especially in complex biomedical sentences\. This degradation can stem from several factors inherent to current translation models and LLMs used for translation\. For instance, generic LLMs, if configured incorrectly during inference \(e\.g\. insufficient context window sizes\), may fail to attend to the entire input sequence, effectivelyforgettingearlier parts of the input and producing incomplete or nonsensical translations\. Likewise, many dedicated translation models are trained on sentences\. Consequently, their ability to handle longer sequences degrades, as the positional embeddings beyond the trained length are less reliable, leading to instability and errors in translation\[costa2022no\]\. To ensure the reliability of the translated corpus, we therefore opted for a context size of 384 tokens for NLLB 200, which offered a favorable balance between translation throughput and linguistic accuracy, mitigating some of these input length\-related issues\.
### 5\.3Limitations and Future Work
While this study provides valuable insights into domain adaptation strategies for German biomedical language models, several limitations remain and point to promising directions for future research\.
Our investigation was limited to the RoBERTa architecture, following the design path of GeistBERT\[scheibleschmitt2025geistbertbreathinglifegerman\]and GottBERT\[scheible2020gottbert\]\. Although this ensured comparability, alternative Transformer architectures, including the recently introduced ModernBERT\[warner2024smarter\], may offer performance advantages in terms of computational efficiency and input size\. RoBERTa’s maximum input size limitation becomes particularly constraining in biomedical contexts, where clinical documents such as patient records or scientific articles often involve extended contexts\. Future work should explore long\-context Transformers\. Architectures such as Longformer\[beltagy2020longformer\]and Nyströmformer\[xiong2021nystromformer\]could offer significant advantages in tasks requiring document\-level understanding or the resolution of complex cross\-sentence dependencies\[shalumov2023herorobertalongformerhebrew\]\.
Furthermore, our findings indicate that training from scratch can be most effective under certain conditions\. This discrepancy highlights the need for further investigation into the factors that influence the effectiveness of training from scratch versus continued pre\-training\. Future work should focus on exploring the specific scenarios and downstream tasks where training from scratch might outperform continued pre\-training\. It would be valuable to conduct a more detailed analysis of the trade\-offs between computational resources, training time, and performance gains\. Understanding these dynamics could provide insights into optimizing model training strategies for various applications\. Similarly, while our models build directly on GeistBERT and GottBERT in terms of tokenizer design and vocabulary size, these decisions were not revisited for the biomedical domain\. Given the distinct lexical properties of medical language, alternative vocabulary sizes or tokenization schemes might further optimize model performance\.
Moreover, the range of biomedical benchmarks used, while diverse, does not fully reflect the variety of clinical language processing needs\. Tasks involving decision support, complex narratives, and clinical reasoning were underrepresented, which will hopefully change in the near future with the release of the German Medical Text Corpus Project\[meineke2023announcement\]\. Addressing these gaps is important, especially in light of models like medBERT\.de\[bressem2024medbert\], which explicitly targeted such scenarios\. In addition, subdomains such as radiology, psychiatry, and primary care were not systematically explored, limiting our conclusions about generalizability\.
Our corpus design also presents limitations\. While our approach exclusively used biomedical data, models like GeBERTa\[dada2023impact\]have demonstrated that mixed\-domain corpora can enhance generalization, particularly for tasks that bridge specialized and general language\. Investigating mixed corpus strategies within the same RoBERTa architecture could therefore provide deeper insights into optimal corpus design for domain\-adaptive pre\-training\. Further considerations result from performing translation to augment the pre\-training corpus\. Although translation enabled the creation of a large biomedical corpus, the quality of this synthetic data was not manually verified by healthcare professionals and, as discussed, was sensitive to context size, with larger sizes impairing coherence\. Future work could investigate the effects of translation quality and translation model behavior itself to assess and mitigate such artifacts more systematically\. Likewise, we did not systematically analyze the individual contributions of the different data sources within our corpus; it remains unclear to what extent the translated data specifically improved performance compared to relying solely on the original German sources\. Additionally, de\-identified datasets, i\.e\. MIMIC\-IV, contain artifacts such as anonymization masks, which are not typically found in natural prose, potentially affecting performance on other types of text\. As a byproduct of the translation effort, we have obtained a large bilingual corpus of clinical texts \(MIMIC\-IV Notes\) and biomedical literature \(PubMed Central\)\. This corpus could be used to fine\-tune German–English translation models for the biomedical domain, supporting both direct clinical applications and future corpus creation\.
Lastly, one should be cautious when interpreting results in cases of potential data leakage\. In the case of GeBERTa, the pre\-training corpus included CLEF data \(without labels\), which may still confer an advantage in classification tasks involving this benchmark\. In contrast, medBERT\.de was pre\-trained on GGPONC, which is also part of our NER evaluation\. However, medBERT\.de performed worse than several models without similar data leakage, suggesting that this exposure did not translate into a measurable advantage in case of NER\. This underlines the importance of careful dataset curation and transparency when reporting benchmark results, while also showing that plain\-text overlap alone does not guarantee performance gains\. It remains unclear whether, and to what extent, such data leakage impact downstream performance, especially since both CLEF and GGPONC are relatively small compared to the full pre\-training corpora\. Determining their exact influence would require dedicated experiments, but we highlight the potential of such effects for further investigation\.
## 6Conclusion
This study systematically explored domain adaptation strategies for German biomedical language models: continued pre\-training from a general\-domain model, training from scratch on biomedical data, and adapting the tokenizer with domain\-specific BPE\. Central to this effort was the creation of a large\-scale pre\-training corpus, enriched through translation\-based data augmentation to address the scarcity of German clinical text\.
Three models were trained using these strategies and benchmarked against existing general and medical German models\. Evaluations included intrinsic perplexity and extrinsic performance across five NER and classification datasets\. The ChristBERT models achieved state\-of\-the\-art results in 4 of 5 tasks of our setup, though no single strategy consistently outperformed the others\. Continued pre\-training proved efficient and strong on certain NER tasks; training from scratch excelled in classification; BPE adaptation offered nuanced gains, particularly for specialized terminology\. Based on our evaluations, the optimal adaptation strategy depends on task requirements and resource constraints\.
This work contributes state\-of\-the\-art German biomedical language models and provides valuable insights into domain adaptation strategies, paving the way for future advancements in clinical text processing and mining\. All models including some resources are publicly released to support continued research and application\.
The authors gratefully acknowledge the scientific support and resources of the AI service infrastructure LRZ AI Systems provided by the Leibniz Supercomputing Centre \(LRZ\) of the Bavarian Academy of Sciences and Humanities \(BAdW\), funded by Bayerisches Staatsministerium für Wissenschaft und Kunst \(StMWK\)\. The authors gratefully acknowledge the resources on the LiCCA HPC cluster of the University of Augsburg, co\-funded by the Deutsche Forschungsgemeinschaft \(DFG, German Research Foundation\) – Project\-ID 499211671\. We would like to thank hpsmedia, especially Andreas Lauterbach, for their data contribution, and the authors of medBERT\.de, in particular Keno Bressem, for their assistance regarding certain areas of the corpus\. We are also grateful to Richard Zowalla for his helpful communication concerning the sGHW project and his openness in sharing insights\. Furthermore, we thank Karen Luna Samanez for providing an initial code base for web data deduplication, which supported the corpus preparation for this work\.
\\bmhead
Data availability The pretraining corpus consists of publicly available and licensed biomedical sources, including open\-access medical literature, de\-identified clinical notes, and curated web data\. Redistribution may be restricted for some datasets due to licensing constraints\. All resulting models are available on Huggingface\.
\\bmhead
Materials availability The pre\-trained ChristBERT models are publicly available at[https://huggingface\.co/ChristBERT](https://huggingface.co/ChristBERT)\. Fairseq checkpoints can be provided upon request\.
\\bmhead
Consent for publication Not applicable\.
\\bmhead
Conflict of interest The authors declare that they have no competing interests\.
\\bmhead
Author contribution Conceptualization, Raphael Scheible\-Schmitt; Data curation, Henry He, Raphael Scheible\-Schmitt and Johann Frei; Formal analysis, Henry He; Investigation, Henry He and Raphael Scheible\-Schmitt; Methodology, Henry He and Raphael Scheible\-Schmitt; Project administration, Raphael Scheible\-Schmitt; Resources, Johann Frei and Raphael Scheible\-Schmitt; Software, Henry He, Johann Frei and Raphael Scheible\-Schmitt; Supervision, Raphael Scheible\-Schmitt; Validation, Henry He; Visualization, Henry He and Raphael Scheible\-Schmitt; Writing – original draft, Henry He,Johann Frei and Raphael Scheible\-Schmitt; Writing – review and editing, Henry He, Johann Frei and Raphael Scheible\-Schmitt\. All authors read and approved the final manuscript\.
\\bmhead
Ethics approval and consent to participate Not applicable\.
\\bmhead
Funding Not applicable\.
## References
Supplementary Material
## Appendix APerplexity
Figure[S1](https://arxiv.org/html/2606.03250#A1.F1)illustrates the training instability observed during the diverged pre\-training of ChristBERTscratch\. The plot shows perplexity on the validation split of the pre\-training corpus across optimization steps\. A sharp increase in perplexity is visible around step 12,500, indicating a failure to converge\.
Figure S1:Perplexity during diverged pre\-training of ChristBERTscratch\. Perplexity is shown in log scale for every optimization step and evaluated on the validation split of the pre\-training corpus\. The plot illustrates a sharp increase in perplexity around the 12,500th step, indicating model instability and failure to converge\.
## Appendix BModel Properties
Table[S2](https://arxiv.org/html/2606.03250#A2.F2)summarizes the vocabulary size and number of parameters for each evaluated model\. While this table focuses on model size, other architectural differences are not shown\.
Figure S2:The vocabulary size and parameter size are shown for the evaluated models\. This table does not show other design differences of the models\. Values extracted usingHuggingface Transformerslibrary\.
## Appendix CTiming and Hyperparameter Search Overview
The total computation time required for pre\-training is detailed in Table[S1](https://arxiv.org/html/2606.03250#A3.T1)\. In addition, Table[S2](https://arxiv.org/html/2606.03250#A3.T2)reports the time spent on hyperparameter grid search for all downstream tasks, performed on a single NVIDIA RTX 3090 GPU\. Table[S3](https://arxiv.org/html/2606.03250#A3.T3)lists the fine\-tuning \(FT\) and inference \(PT\) runtimes for the final selected models, also measured on the same hardware\.
The best\-performing hyperparameter configurations \(batch size and learning rate\) for each task and model are provided in Table[S4](https://arxiv.org/html/2606.03250#A3.T4)\.
Table S1:Pre\-training computation time in days, hours and minutes summing up to 521 hours and 54 minutes, which are approximately 21\.74 days\.Table S2:Computation time in hours, minutes and seconds spent on the hyperparameter grid search for finding the best models for each task\. The grid search was performed on a single NVIDIA RTX 3090 GPU with 24 GB VRAM\. The total computation time for hyperparameter optimization sums up to 161 hours and 46 minutes, which are approximately 6\.74 days\.Table S3:Fine\-tuning \(FT\) runtime in minutes and seconds, and prediction runtime \(PT\) in seconds of the best downstream task models for each task\. Both were performed on one NVIDIA RTX 3090 GPU with 24 GB VRAM\.Table S4:Hyperparameters of the best downstream task models for each task and pre\-trained model\. BS and LR denote batch size and learning rate, respectively\.
## Appendix DDownstream Task Evaluation
Tables[S5](https://arxiv.org/html/2606.03250#A4.T5)through[S8](https://arxiv.org/html/2606.03250#A4.T8)present a detailed breakdown of evaluation results on the downstream tasks\. For each dataset, BRONCO150 \(Table[S5](https://arxiv.org/html/2606.03250#A4.T5)\), CARDIO:DE \(Table[S6](https://arxiv.org/html/2606.03250#A4.T6)\), GGPONC \(Table[S7](https://arxiv.org/html/2606.03250#A4.T7)\), and JSynCC \(Table[S8](https://arxiv.org/html/2606.03250#A4.T8)\), the precision, recall, and F1\-scores are reported for each class or entity\.
All results are shown as percentages and refer to the best fine\-tuned model selected based on validation set performance out of 28 grid search runs\. The best results are highlighted in bold and the second\-best are underlined\.
Table S5:Overview of per entity precision \(Prec\.\), recall \(Rec\.\) and F1scores achieved on the BRONCO150 dataset All results are shown in percent and assess each model’s best fine\-tuned performance on the test set\. The best model was selected out of 28 runs based on its validation set performance\. Best score in bold and second best underlined\.Table S6:Overview of per entity precision \(Prec\.\), recall \(Rec\.\) and F1scores achieved on the CARDIO:DE dataset All results are shown in percent and assess each model’s best fine\-tuned performance on the test set\. The best model was selected out of 28 runs based on its validation set performance\. Best score in bold and second best underlined\.Table S7:Overview of per entity precision \(Prec\.\), recall \(Rec\.\) and F1scores achieved on the GGPONC dataset All results are shown in percent and assess each model’s best fine\-tuned performance on the test set\. The best model was selected out of 28 runs based on its validation set performance\. Best score in bold and second best underlined\.Table S8:Overview of per class precision \(Prec\.\), recall \(Rec\.\) and F1scores achieved on the JSynCC dataset All results are shown in percent and assess each model’s best fine\-tuned performance on the test set\. The best model was selected out of 28 runs based on its validation set performance\. Best score in bold and second best underlined\.Similar Articles
Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG
This paper compares two strategies for injecting structured biomedical knowledge from the UMLS Metathesaurus into language models: continual pretraining (embedding knowledge into model parameters) and GraphRAG (querying a knowledge graph at inference time). Results show improvements on biomedical QA benchmarks, with GraphRAG on LLaMA 3-8B yielding over 3 and 5 accuracy points on PubMedQA and BioASQ respectively without any retraining.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
This paper demonstrates that switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts. The authors release ModernBERT-bio and ModernCamemBERT-bio as state-of-the-art biomedical encoders.
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
MedicalBench is a new benchmark for evaluating large language models on medical concept extraction from electronic health records, focusing on implicit reasoning and evidence grounding. It includes 823 expert-annotated examples and shows that current models perform modestly, highlighting the difficulty of extracting implicitly stated medical concepts.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.