ChatPaper.aiChatPaper

序列對序列:一套開源的配對編碼器與解碼器套件

Seq vs Seq: An Open Suite of Paired Encoders and Decoders

July 15, 2025
作者: Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme
cs.AI

摘要

大型語言模型(LLM)社群幾乎專注於僅解碼器架構的語言模型,因為這類模型在文本生成任務上更易於使用。然而,仍有相當一部分社群使用僅編碼器模型來處理分類或檢索等任務。先前的研究曾嘗試比較這些架構,但不得不面對模型參數量、訓練技術和數據集不同的情況。我們引入了SOTA開源數據的Ettin模型套件:包含從1700萬到10億參數的配對僅編碼器和僅解碼器模型,這些模型在最多2萬億個token上進行了訓練。對僅編碼器和僅解碼器模型採用相同的訓練方案,使我們在各自規模的類別中均獲得了SOTA的訓練方案,超越了作為編碼器的ModernBERT,以及作為解碼器的Llama 3.2和SmolLM2。與先前的研究一致,我們發現僅編碼器模型在分類和檢索任務上表現出色,而解碼器模型則擅長生成任務。然而,我們證明,通過持續訓練將解碼器模型適應於編碼器任務(反之亦然),其效果不如僅使用反向目標(即一個4億參數的編碼器在MNLI上優於一個10億參數的解碼器,而在生成任務上則相反)。我們開源了本研究的全部成果,包括訓練數據、按檢查點劃分的訓練順序,以及200多個檢查點,以便未來的研究能夠分析或擴展訓練的各個方面。
English
The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.
PDF157July 17, 2025