SeaLLMs 3：開放式基礎和多語言聊天大型語言模型，適用於東南亞語言

摘要

大型語言模型（LLMs）展現出在各種任務上的卓越能力，然而它們的發展主要集中在像英語和中文這樣的高資源語言上，使得低資源語言得不到應有的支持。為了解決這種不均衡情況，我們介紹 SeaLLMs 3，這是 SeaLLMs 模型系列的最新版本，專為東南亞語言量身打造。這個地區以其豐富的語言多樣性而聞名，卻缺乏足夠的語言技術支持。SeaLLMs 3 的目標是彌合這一差距，覆蓋該地區使用的包括英語、中文、印尼語、越南語、泰語、菲律賓語、馬來語、緬甸語、高棉語、老撾語、泰米爾語和爪哇語在內的全面語言範疇。通過利用高效的語言增強技術和特別構建的指導調整數據集，SeaLLMs 3 顯著降低了訓練成本，同時保持高性能和多功能性。我們的模型在世界知識、數學推理、翻譯和指令跟隨等任務中表現出色，實現了與同等大小模型相當的最先進性能。此外，我們優先考慮了安全性和可靠性，同時解決了一般和文化特定考量，並納入了減少幻覺的機制。這項工作強調了包容性人工智能的重要性，顯示先進的LLM能力可以造福被忽視的語言和文化社區。

English

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.