마스킹 언어 모델링으로 인코더를 여전히 사전 학습해야 할까?

초록

고품질 텍스트 표현을 학습하는 것은 다양한 NLP 작업의 기초가 됩니다. 전통적으로 인코더 사전 학습은 마스크드 언어 모델링(Masked Language Modeling, MLM)에 의존해 왔지만, 최근 연구에 따르면 인과적 언어 모델링(Causal Language Modeling, CLM)으로 사전 학습된 디코더 모델이 인코더로 효과적으로 재사용될 수 있으며, 종종 텍스트 표현 벤치마크에서 전통적인 인코더를 능가하는 것으로 나타났습니다. 그러나 이러한 성능 향상이 CLM 목표 함수의 고유한 장점을 반영하는 것인지, 아니면 모델 및 데이터 규모와 같은 혼동 요인에서 비롯된 것인지는 여전히 명확하지 않습니다. 본 논문에서는 이 질문에 답하기 위해 대규모의 신중하게 통제된 사전 학습 실험을 통해 2억 1천만에서 10억 파라미터에 이르는 총 30개의 모델을 학습하고, 15,000회 이상의 미세 조정 및 평가를 수행했습니다. 우리는 MLM으로 학습한 모델이 일반적으로 텍스트 표현 작업에서 더 나은 성능을 보이지만, CLM으로 학습한 모델은 데이터 효율성이 더 높고 미세 조정 안정성이 개선된 것을 발견했습니다. 이러한 발견을 바탕으로, CLM을 먼저 적용한 후 MLM을 순차적으로 적용하는 이중 단계 학습 전략이 고정된 계산 예산 내에서 최적의 성능을 달성함을 실험적으로 보여줍니다. 또한, 이 전략은 기존의 대규모 언어 모델(LLM) 생태계에서 쉽게 사용할 수 있는 사전 학습된 CLM 모델로 초기화할 때 더욱 매력적이 되며, 최고 수준의 인코더 모델을 학습하는 데 필요한 계산 부담을 줄이는 것을 입증했습니다. 우리는 추가 연구를 촉진하기 위해 모든 프로젝트 아티팩트를 https://hf.co/MLMvsCLM에서 공개합니다.

English

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

마스킹 언어 모델링으로 인코더를 여전히 사전 학습해야 할까?

Should We Still Pretrain Encoders with Masked Language Modeling?

초록

Support