我们是否仍应使用掩码语言模型对编码器进行预训练？

摘要

学习高质量的文本表示是众多自然语言处理（NLP）任务的基础。尽管编码器预训练传统上依赖于掩码语言建模（MLM），但近期研究表明，通过因果语言建模（CLM）预训练的解码器模型能够有效地被改造为编码器，在文本表示基准测试中往往超越传统编码器。然而，这些性能提升究竟源于CLM目标的内在优势，还是由模型规模和数据规模等混杂因素所致，尚不明确。本文通过一系列大规模、严格控制的预训练消融实验，训练了从2.1亿到10亿参数不等的共计30个模型，并进行了超过15,000次的微调与评估运行，以解答这一问题。我们发现，尽管MLM训练通常在各类文本表示任务中表现更优，但CLM训练的模型在数据效率上更胜一筹，并展现出更好的微调稳定性。基于这些发现，我们通过实验证明，在固定的计算训练预算下，采用先CLM后MLM的双阶段训练策略能够实现最佳性能。此外，我们展示，当从现有大型语言模型生态系统中现成的CLM预训练模型初始化时，这一策略更具吸引力，显著减少了训练顶尖编码器模型所需的计算负担。我们已将项目所有成果发布于https://hf.co/MLMvsCLM，以促进进一步研究。

English

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

我们是否仍应使用掩码语言模型对编码器进行预训练？

Should We Still Pretrain Encoders with Masked Language Modeling?

摘要

Support