Sailor2:在東南亞航行,擁抱包容性多語言大型語言模型
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
February 18, 2025
作者: Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
cs.AI
摘要
Sailor2 是一系列針對東南亞(SEA)語言的尖端多語言模型,提供 1B、8B 和 20B 三種規模,以滿足多樣化的應用需求。基於 Qwen2.5 的基礎,Sailor2 經過了 5000 億詞元(其中 4000 億為 SEA 專用詞元,1000 億為重放詞元)的持續預訓練,支援 13 種 SEA 語言,同時保持對中文和英文的熟練度。Sailor2-20B 模型在 SEA 語言上與 GPT-4o 的對比中達到了 50-50 的勝率。我們還提供了一份全面的指南,詳細介紹如何高效開發多語言模型,涵蓋數據整理、預訓練、後訓練、模型定制和評估五個關鍵方面。我們希望 Sailor2 模型(採用 Apache 2.0 許可證)能推動 SEA 地區的語言發展,而 Sailor2 指南則能激勵研究人員為其他未被充分服務的語言構建更具包容性的大型語言模型。
English
Sailor2 is a family of cutting-edge multilingual language models for
South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit
diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous
pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to
support 13 SEA languages while retaining proficiency in Chinese and English.
Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA
languages. We also deliver a comprehensive cookbook on how to develop the
multilingual model in an efficient manner, including five key aspects: data
curation, pre-training, post-training, model customization and evaluation. We
hope that Sailor2 model (Apache 2.0 license) will drive language development in
the SEA region, and Sailor2 cookbook will inspire researchers to build more
inclusive LLMs for other under-served languages.Summary
AI-Generated Summary