エンコーダをマスク言語モデリングで事前学習すべきか？

要旨

高品質なテキスト表現を学習することは、幅広いNLPタスクにおいて基本的な課題です。従来、エンコーダの事前学習は主にMasked Language Modeling（MLM）に依存してきましたが、最近の研究では、Causal Language Modeling（CLM）で事前学習されたデコーダモデルをエンコーダとして再利用することが有効であり、テキスト表現のベンチマークにおいて従来のエンコーダをしばしば上回ることが示されています。しかし、これらの性能向上がCLM目的関数の本質的な優位性によるものなのか、あるいはモデルやデータ規模などの交絡要因によるものなのかは不明瞭です。本論文では、この疑問を解決するため、大規模かつ注意深く制御された事前学習のアブレーション実験を実施し、2億1,000万から10億パラメータまでの合計30のモデルを訓練し、15,000回以上のファインチューニングと評価を実施しました。その結果、MLMによる訓練は一般的にテキスト表現タスク全体で優れた性能を発揮する一方、CLMで訓練されたモデルはデータ効率が高く、ファインチューニングの安定性が向上していることがわかりました。これらの知見を基に、CLMを先行させた後にMLMを適用する二段階訓練戦略が、固定された計算予算の下で最適な性能を達成することを実験的に示しました。さらに、既存のLLMエコシステムから利用可能な事前学習済みCLMモデルを初期化する場合、この戦略がより魅力的になり、最高水準のエンコーダモデルを訓練するために必要な計算負荷を軽減できることを実証しました。今後の研究を促進するため、すべてのプロジェクト成果物をhttps://hf.co/MLMvsCLMで公開しています。

English

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

エンコーダをマスク言語モデリングで事前学習すべきか？

Should We Still Pretrain Encoders with Masked Language Modeling?

要旨

Support