mmBERT: 어닐링 언어 학습을 적용한 현대적 다국어 인코더

초록

인코더 전용 언어 모델은 분류 및 검색을 포함한 다양한 표준 머신러닝 작업에 자주 사용됩니다. 그러나 최근 들어, 특히 다국어 모델과 관련하여 인코더 모델에 대한 연구가 부족한 상황입니다. 우리는 1800개 이상의 언어로 된 3조 개의 다국어 텍스트로 사전 학습된 인코더 전용 언어 모델인 mmBERT를 소개합니다. mmBERT를 구축하기 위해 역 마스크 비율 스케줄과 역 온도 샘플링 비율을 포함한 여러 새로운 요소를 도입했습니다. 우리는 감쇠 단계에서만 데이터 믹스에 1700개 이상의 저자원 언어를 추가하여, 이들이 성능을 극적으로 향상시키고 상대적으로 적은 양의 학습 데이터에서 얻을 수 있는 이점을 극대화함을 보여줍니다. 이러한 저자원 언어를 짧은 감쇠 단계에만 포함시켰음에도 불구하고, 우리는 OpenAI의 o3와 Google의 Gemini 2.5 Pro와 유사한 분류 성능을 달성했습니다. 전반적으로, mmBERT가 고자원 및 저자원 언어 모두에서 분류 및 검색 작업에서 이전 세대 모델을 크게 능가함을 보여줍니다.

English

Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.

mmBERT: 어닐링 언어 학습을 적용한 현대적 다국어 인코더

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

초록

Support