ChatPaper.aiChatPaper

现代GBERT:从头训练的纯德语10亿参数编码器模型

ModernGBERT: German-only 1B Encoder Model Trained from Scratch

May 19, 2025
作者: Anton Ehrmanntraut, Julia Wunderle, Jan Pfister, Fotis Jannidis, Andreas Hotho
cs.AI

摘要

尽管仅解码器语言模型占据主导地位,编码器在资源受限的应用场景中仍不可或缺。我们推出了ModernGBERT(134M、1B),这是一个完全透明的德语编码器模型家族,从头开始训练,并融入了ModernBERT的架构创新。为了评估从头训练编码器的实际权衡,我们还提出了LL\"aMmlein2Vec(120M、1B、7B),这是一个通过LLM2Vec从德语仅解码器模型衍生而来的编码器家族。我们在自然语言理解、文本嵌入和长上下文推理任务上对所有模型进行了基准测试,实现了专用编码器与转换解码器之间的可控对比。我们的结果表明,ModernGBERT 1B在性能和参数效率方面均优于先前最先进的德语编码器以及通过LLM2Vec适配的编码器。所有模型、训练数据、检查点和代码均已公开,以透明、高性能的编码器模型推动德语NLP生态系统的发展。
English
Despite the prominence of decoder-only language models, encoders remain crucial for resource-constrained applications. We introduce ModernGBERT (134M, 1B), a fully transparent family of German encoder models trained from scratch, incorporating architectural innovations from ModernBERT. To evaluate the practical trade-offs of training encoders from scratch, we also present LL\"aMmlein2Vec (120M, 1B, 7B), a family of encoders derived from German decoder-only models via LLM2Vec. We benchmark all models on natural language understanding, text embedding, and long-context reasoning tasks, enabling a controlled comparison between dedicated encoders and converted decoders. Our results show that ModernGBERT 1B outperforms prior state-of-the-art German encoders as well as encoders adapted via LLM2Vec, with regard to performance and parameter-efficiency. All models, training data, checkpoints and code are publicly available, advancing the German NLP ecosystem with transparent, high-performance encoder models.

Summary

AI-Generated Summary

PDF172May 27, 2025