因果語言建模的繞路方式提升編碼器持續預訓練
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
May 12, 2026
作者: Rian Touchent, Eric de la Clergerie
cs.AI
摘要
在将编码器适配至新领域时,标准做法是继续使用掩码语言建模(MLM)进行训练。我们证明,临时切换至因果语言建模(CLM)并随后进行短期的MLM衰减,可以提升下游性能。在以ModernBERT为基础的生物医学文本上,这种CLM迂回训练在8项法语和11项英语生物医学任务中,分别比基于相同数据和计算资源的MLM基线高出1.2-2.8个百分点和0.3-0.8个百分点(视模型规模而定)。我们探究了这些增益的成因。发现CLM的密集监督对低层Transformer层(0-7层)的影响远超MLM。在CLM过程中冻结低层会消除下游性能优势;而冻结中间层则保留了优势。表征变化会持续贯穿MLM衰减阶段——即便该阶段与CLM阶段长度相同——且随模型容量增大而放大。我们发布了ModernCamemBERT-bio和ModernBERT-bio,作为Base和Large尺寸下最先进的生物医学编码器。
English
When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.