我們是否仍應使用遮蔽語言模型來預訓練編碼器？

摘要

學習高質量的文本表示是眾多自然語言處理任務的基礎。雖然編碼器預訓練傳統上依賴於掩碼語言建模（MLM），但最近的證據表明，通過因果語言建模（CLM）預訓練的解碼器模型可以有效地轉化為編碼器，在文本表示基準測試中往往超越傳統編碼器。然而，這些性能提升是否反映了CLM目標的固有優勢，還是源於模型和數據規模等混淆因素，仍不明確。在本論文中，我們通過一系列大規模、嚴格控制的預訓練消融實驗來探討這一問題，總共訓練了30個模型，參數量從2.1億到10億不等，並進行了超過15,000次的微調和評估運行。我們發現，雖然使用MLM訓練通常能在文本表示任務中獲得更好的性能，但CLM訓練的模型更具數據效率，並展現出更優的微調穩定性。基於這些發現，我們通過實驗證明，在固定的計算訓練預算下，依次應用CLM和MLM的雙階段訓練策略能實現最佳性能。此外，我們還展示，當從現有的大型語言模型生態系統中現成的CLM預訓練模型初始化時，這一策略變得更加吸引人，因為它減少了訓練頂級編碼器模型所需的計算負擔。我們在https://hf.co/MLMvsCLM上發布了所有項目成果，以促進進一步的研究。

English

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

我們是否仍應使用遮蔽語言模型來預訓練編碼器？

Should We Still Pretrain Encoders with Masked Language Modeling?

摘要

Support