序列对序列:一套开放的配对编码器与解码器集
Seq vs Seq: An Open Suite of Paired Encoders and Decoders
July 15, 2025
作者: Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme
cs.AI
摘要
大型语言模型(LLM)社区几乎完全专注于仅解码器架构的语言模型,因为它们更易于用于文本生成。然而,仍有相当一部分社区在分类或检索等任务中使用仅编码器模型。先前的研究尝试比较这些架构,但不得不面对模型参数数量、训练技术和数据集不同的情况。我们引入了SOTA开放数据Ettin模型套件:包含从1700万到10亿参数的成对仅编码器和仅解码器模型,训练数据量高达2万亿个token。对仅编码器和仅解码器模型采用相同的训练方案,在各自规模类别中均产生了SOTA的训练方案,作为编码器超越了ModernBERT,作为解码器则优于Llama 3.2和SmolLM2。与先前研究一致,我们发现仅编码器模型在分类和检索任务上表现优异,而解码器在生成任务上更胜一筹。然而,我们证明通过持续训练将解码器模型适应于编码器任务(反之亦然)的效果,不如直接使用相反目标模型(例如,在MNLI任务上,400M的编码器优于10B的解码器,而在生成任务上则相反)。我们开源了本研究的全部成果,包括训练数据、按检查点分段的训练顺序以及200多个检查点,以便未来工作能够分析或扩展训练的各个方面。
English
The large language model (LLM) community focuses almost exclusively on
decoder-only language models, since they are easier to use for text generation.
However, a large subset of the community still uses encoder-only models for
tasks such as classification or retrieval. Previous work has attempted to
compare these architectures, but is forced to make comparisons with models that
have different numbers of parameters, training techniques, and datasets. We
introduce the SOTA open-data Ettin suite of models: paired encoder-only and
decoder-only models ranging from 17 million parameters to 1 billion, trained on
up to 2 trillion tokens. Using the same recipe for both encoder-only and
decoder-only models produces SOTA recipes in both categories for their
respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as
decoders. Like previous work, we find that encoder-only models excel at
classification and retrieval tasks while decoders excel at generative tasks.
However, we show that adapting a decoder model to encoder tasks (and vice
versa) through continued training is subpar compared to using only the reverse
objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa
for generative tasks). We open-source all artifacts of this study including
training data, training order segmented by checkpoint, and 200+ checkpoints to
allow future work to analyze or extend all aspects of training.