A Causal Language Modeling Detour Improves Encoder Continued Pretraining
Summary
This paper demonstrates that switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts. The authors release ModernBERT-bio and ModernCamemBERT-bio as state-of-the-art biomedical encoders.
View Cached Full Text
Cached at: 05/13/26, 12:15 PM
Paper page - A Causal Language Modeling Detour Improves Encoder Continued Pretraining
Source: https://huggingface.co/papers/2605.12438
Abstract
Switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts through dense supervision effects in lower transformer layers.
When adapting an encoder to a new domain, the standard approach is to continue training withMasked Language Modeling(MLM). We show that temporarily switching toCausal Language Modeling(CLM) followed by a short MLM decay improvesdownstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM’s dense supervision impacts lowtransformer layers(0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. Therepresentational changespersist through the MLM decay phase, even when it matches the CLM phase in length, and they scale withmodel capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.
View arXiv pageView PDFAdd to collection
Community
Hi@rntc, very cool idea! Do you btw. Plan to release the code, I would like to try this with other models for domain adaption 😃
·
Paper submitter
Release of ModernBERT-bio and ModernCamemBERT-bio
Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.
Tap or paste here to upload images
Get this paper in your agent:
hf papers read 2605\.12438
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper4
#### almanach/ModernBERT-bio-large Fill-Mask• 0.4B• Updatedabout 4 hours ago • 34 • 2
#### almanach/ModernCamemBERT-bio-base Fill-Mask• Updatedabout 4 hours ago • 1
#### almanach/ModernCamemBERT-bio-large Fill-Mask• 0.4B• Updatedabout 4 hours ago • 226 • 1
#### almanach/ModernBERT-bio-base Fill-Mask• 0.1B• Updatedabout 4 hours ago • 32 • 1
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12438 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12438 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance
This paper systematically evaluates the impact of classification model selection within the InferBERT framework for causal adverse drug event detection, finding that domain-specific pre-training (BioBERT) outperforms both simpler models and larger LLMs like Med-LLaMA.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.
Attribution-Guided Continual Learning for Large Language Models
This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.
A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions
This paper presents a computational audit of representational bias in ClinicalBERT, finding that demographic associations are amplified by the model itself rather than inherited from training data.
The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP
This paper introduces ChristBERT, a family of domain-specific RoBERTa-based language models for German clinical NLP, and evaluates three domain adaptation strategies (continued pre-training, pre-training from scratch, and vocabulary adaptation) on medical named entity recognition and text classification tasks, achieving state-of-the-art results.