Seq vs Seq: Una Suite Aperta di Encoder e Decoder Accoppiati

Abstract

La comunità dei grandi modelli linguistici (LLM) si concentra quasi esclusivamente su modelli linguistici di tipo decoder-only, poiché sono più facili da utilizzare per la generazione di testo. Tuttavia, una vasta parte della comunità continua a utilizzare modelli encoder-only per compiti come la classificazione o il retrieval. Precedenti lavori hanno tentato di confrontare queste architetture, ma sono stati costretti a fare confronti con modelli che hanno un numero diverso di parametri, tecniche di addestramento e dataset. Introduciamo la suite di modelli SOTA open-data Ettin: modelli accoppiati encoder-only e decoder-only che vanno da 17 milioni di parametri a 1 miliardo, addestrati su fino a 2 trilioni di token. Utilizzando la stessa ricetta sia per i modelli encoder-only che decoder-only, produciamo ricette SOTA in entrambe le categorie per le rispettive dimensioni, superando ModernBERT come encoder e Llama 3.2 e SmolLM2 come decoder. Come nei lavori precedenti, troviamo che i modelli encoder-only eccellono nei compiti di classificazione e retrieval, mentre i decoder eccellono nei compiti generativi. Tuttavia, dimostriamo che adattare un modello decoder ai compiti encoder (e viceversa) attraverso un addestramento continuo è inferiore rispetto all'utilizzo solo dell'obiettivo inverso (ad esempio, un encoder da 400M supera un decoder da 1B su MNLI, e viceversa per i compiti generativi). Rendiamo open-source tutti gli artefatti di questo studio, inclusi i dati di addestramento, l'ordine di addestramento segmentato per checkpoint e oltre 200 checkpoint, per consentire a futuri lavori di analizzare o estendere tutti gli aspetti dell'addestramento.

English

The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.

Seq vs Seq: Una Suite Aperta di Encoder e Decoder Accoppiati

Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Abstract

Support