A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Hugging Face Daily Papers Papers

Summary

This paper demonstrates that switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts. The authors release ModernBERT-bio and ModernCamemBERT-bio as state-of-the-art biomedical encoders.

When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.
Original Article
View Cached Full Text

Cached at: 05/13/26, 12:15 PM

Paper page - A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Source: https://huggingface.co/papers/2605.12438

Abstract

Switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts through dense supervision effects in lower transformer layers.

When adapting an encoder to a new domain, the standard approach is to continue training withMasked Language Modeling(MLM). We show that temporarily switching toCausal Language Modeling(CLM) followed by a short MLM decay improvesdownstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM’s dense supervision impacts lowtransformer layers(0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. Therepresentational changespersist through the MLM decay phase, even when it matches the CLM phase in length, and they scale withmodel capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.

View arXiv pageView PDFAdd to collection

Community

Hi@rntc, very cool idea! Do you btw. Plan to release the code, I would like to try this with other models for domain adaption 😃

·

Paper submitter

about 4 hours ago

Release of ModernBERT-bio and ModernCamemBERT-bio

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2605\.12438

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper4

#### almanach/ModernBERT-bio-large Fill-Mask• 0.4B• Updatedabout 4 hours ago • 34 • 2 #### almanach/ModernCamemBERT-bio-base Fill-Mask• Updatedabout 4 hours ago • 1 #### almanach/ModernCamemBERT-bio-large Fill-Mask• 0.4B• Updatedabout 4 hours ago • 226 • 1 #### almanach/ModernBERT-bio-base Fill-Mask• 0.1B• Updatedabout 4 hours ago • 32 • 1

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12438 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12438 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

arXiv cs.CL

This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.

Attribution-Guided Continual Learning for Large Language Models

arXiv cs.LG

This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.