mmBERT：一种采用退火语言学习的现代多语言编码器

摘要

仅编码器语言模型广泛应用于各类标准机器学习任务，包括分类与检索。然而，近期针对编码器模型的研究，尤其是多语言模型领域，相对匮乏。我们推出了mmBERT，这是一个仅编码器的语言模型，预训练于超过1800种语言的3万亿个多语言文本标记上。在构建mmBERT的过程中，我们引入了多项创新元素，如逆掩码比率调度和逆温度采样比率。我们仅在衰减阶段向数据集中加入了超过1700种低资源语言，结果表明，此举显著提升了模型性能，并最大化利用了相对有限的训练数据带来的收益。尽管这些低资源语言仅在短暂的衰减阶段被纳入，mmBERT在分类任务上的表现已与OpenAI的o3和Google的Gemini 2.5 Pro等模型相当。总体而言，我们证明mmBERT在分类和检索任务上，无论是高资源还是低资源语言，均显著超越了前一代模型。

English

Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.

mmBERT：一种采用退火语言学习的现代多语言编码器

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

摘要

Support