因果言語モデリングの迂回がエンコーダの継続事前学習を改善する

要旨

エンコーダを新しいドメインに適応させる標準的なアプローチは、マスク言語モデリング（MLM）を用いた継続学習である。本稿では、一時的に因果言語モデリング（CLM）に切り替え、その後短いMLMによる減衰を行うことで、ダウンストリーム性能が向上することを示す。ModernBERTを用いた生物医学テキストにおいて、このCLM迂回法は、同一データと計算量で学習したMLMベースラインを、フランス語の生物医学タスク8種では+1.2〜2.8ポイント、英語の生物医学タスク11種では+0.3〜0.8ポイント（モデルサイズに依存）上回る。我々はこの性能向上の理由を調査した。その結果、CLMの密な教師信号は、MLMよりもはるかに強く低層のトランスフォーマー層（0〜7層）に影響を与えることが判明した。CLM中に低層を凍結するとダウンストリームの利得は消失するが、中層を凍結してもその利得は維持される。表現の変化は、たとえMLM減衰フェーズがCLMフェーズと同じ長さであっても持続し、モデル容量に応じてスケールする。我々は、最先端の生物医学エンコーダとして、ModernCamemBERT-bioおよびModernBERT-bioをBaseサイズとLargeサイズで公開する。

English

When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.

因果言語モデリングの迂回がエンコーダの継続事前学習を改善する

A Causal Language Modeling Detour Improves Encoder Continued Pretraining

要旨

Support