因果语言建模的迂回策略提升编码器继续预训练
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
May 12, 2026
作者: Rian Touchent, Eric de la Clergerie
cs.AI
摘要
在将编码器适配到新领域时,标准做法是继续使用掩码语言建模(MLM)进行训练。我们证明,暂时切换至因果语言建模(CLM),随后进行短期的MLM衰减,能够提升下游任务性能。在生物医学文本上使用ModernBERT时,这种CLM迂回策略在8项法语和11项英语生物医学任务中,分别比使用相同数据和计算资源训练的MLM基线高出+1.2到2.8个百分点和+0.3到0.8个百分点(具体取决于模型规模)。我们探究了这些收益的原因,发现CLM的密集监督对低层Transformer层(第0至7层)的影响远大于MLM。在CLM过程中冻结低层会消除下游收益;而冻结中间层则不影响收益。表征变化在整个MLM衰减阶段持续存在,即使衰减阶段与CLM阶段长度相同,且这些变化随模型容量扩展。我们发布了ModernCamemBERT-bio和ModernBERT-bio,作为Base和Large规模的最先进生物医学编码器。
English
When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.