Sailor2：在東南亞航行，擁抱包容性多語言大型語言模型

摘要

Sailor2 是一系列針對東南亞（SEA）語言的尖端多語言模型，提供 1B、8B 和 20B 三種規模，以滿足多樣化的應用需求。基於 Qwen2.5 的基礎，Sailor2 經過了 5000 億詞元（其中 4000 億為 SEA 專用詞元，1000 億為重放詞元）的持續預訓練，支援 13 種 SEA 語言，同時保持對中文和英文的熟練度。Sailor2-20B 模型在 SEA 語言上與 GPT-4o 的對比中達到了 50-50 的勝率。我們還提供了一份全面的指南，詳細介紹如何高效開發多語言模型，涵蓋數據整理、預訓練、後訓練、模型定制和評估五個關鍵方面。我們希望 Sailor2 模型（採用 Apache 2.0 許可證）能推動 SEA 地區的語言發展，而 Sailor2 指南則能激勵研究人員為其他未被充分服務的語言構建更具包容性的大型語言模型。

English

Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.

Sailor2：在東南亞航行，擁抱包容性多語言大型語言模型

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

摘要

Support