Dovremmo ancora preaddestrare gli encoder con il modello linguistico mascherato?

Abstract

L'apprendimento di rappresentazioni testuali di alta qualità è fondamentale per un'ampia gamma di attività di NLP. Sebbene il pre-addestramento degli encoder si sia tradizionalmente basato sul Masked Language Modeling (MLM), recenti evidenze suggeriscono che i modelli decoder pre-addestrati con Causal Language Modeling (CLM) possono essere efficacemente riutilizzati come encoder, spesso superando gli encoder tradizionali nei benchmark di rappresentazione testuale. Tuttavia, non è chiaro se questi miglioramenti riflettano un vantaggio intrinseco dell'obiettivo CLM o derivino da fattori confondenti come la scala del modello e dei dati. In questo articolo, affrontiamo questa questione attraverso una serie di ablazioni di pre-addestramento su larga scala e accuratamente controllate, addestrando un totale di 30 modelli che vanno da 210 milioni a 1 miliardo di parametri, e conducendo oltre 15.000 esecuzioni di fine-tuning e valutazione. Scopriamo che, sebbene l'addestramento con MLM generalmente produca prestazioni migliori nelle attività di rappresentazione testuale, i modelli addestrati con CLM sono più efficienti in termini di dati e dimostrano una maggiore stabilità nel fine-tuning. Basandoci su questi risultati, mostriamo sperimentalmente che una strategia di addestramento bifasica che applica sequenzialmente CLM e poi MLM, raggiunge prestazioni ottimali con un budget computazionale di addestramento fisso. Inoltre, dimostriamo che questa strategia diventa più vantaggiosa quando si inizializza da modelli CLM pre-addestrati già disponibili (dall'ecosistema esistente di LLM), riducendo il carico computazionale necessario per addestrare modelli encoder di prima classe. Rilasciamo tutti gli artefatti del progetto su https://hf.co/MLMvsCLM per favorire ulteriori ricerche.

English

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

Dovremmo ancora preaddestrare gli encoder con il modello linguistico mascherato?

Should We Still Pretrain Encoders with Masked Language Modeling?

Abstract

Support