A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

This paper demonstrates that switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts. The authors release ModernBERT-bio and ModernCamemBERT-bio as state-of-the-art biomedical encoders.

When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.

Original Article

View Cached Full Text

Cached at: 05/13/26, 12:15 PM

Paper page - A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Source: https://huggingface.co/papers/2605.12438

Abstract

Switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts through dense supervision effects in lower transformer layers.

When adapting an encoder to a new domain, the standard approach is to continue training withMasked Language Modeling(MLM). We show that temporarily switching toCausal Language Modeling(CLM) followed by a short MLM decay improvesdownstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM’s dense supervision impacts lowtransformer layers(0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. Therepresentational changespersist through the MLM decay phase, even when it matches the CLM phase in length, and they scale withmodel capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.

View arXiv page View PDF Add to collection

Community

Hi@rntc, very cool idea! Do you btw. Plan to release the code, I would like to try this with other models for domain adaption 😃

Paper submitter

about 4 hours ago

Release of ModernBERT-bio and ModernCamemBERT-bio

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2605\.12438

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper4

#### almanach/ModernBERT-bio-large Fill-Mask• 0.4B• Updatedabout 4 hours ago • 34 • 2 #### almanach/ModernCamemBERT-bio-base Fill-Mask• Updatedabout 4 hours ago • 1 #### almanach/ModernCamemBERT-bio-large Fill-Mask• 0.4B• Updatedabout 4 hours ago • 226 • 1 #### almanach/ModernBERT-bio-base Fill-Mask• 0.1B• Updatedabout 4 hours ago • 32 • 1

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12438 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12438 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Paper page - A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Abstract

Community

Models citing this paper4

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

Attribution-Guided Continual Learning for Large Language Models

A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions

The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

Submit Feedback

Similar Articles

The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

Attribution-Guided Continual Learning for Large Language Models

A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions

The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP