SeaLLMs 3：面向东南亚语言的开放基础和聊天多语言大型语言模型

摘要

大型语言模型（LLMs）展现出在各种任务上的显著能力，然而它们的发展主要集中在高资源语言如英语和中文，导致低资源语言得不到充分的支持。为了解决这种不平等，我们提出了SeaLLMs 3，这是SeaLLMs模型系列的最新版本，专为东南亚语言定制。这个地区以其丰富的语言多样性而闻名，却缺乏足够的语言技术支持。SeaLLMs 3的目标是通过涵盖该地区使用的一系列语言，包括英语、中文、印尼语、越南语、泰语、他加禄语、马来语、缅甸语、高棉语、老挝语、泰米尔语和爪哇语，来弥合这一差距。利用高效的语言增强技术和特别构建的指导调整数据集，SeaLLMs 3显著降低了训练成本，同时保持高性能和多功能性。我们的模型在世界知识、数学推理、翻译和指令遵循等任务中表现出色，达到了同等规模模型中的最先进性能。此外，我们优先考虑了安全性和可靠性，解决了通用和文化特定考虑，并加入了减少幻觉的机制。这项工作强调了包容性人工智能的重要性，表明先进的LLM能力可以惠及被忽视的语言和文化社区。

English

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

SeaLLMs 3：面向东南亚语言的开放基础和聊天多语言大型语言模型

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

摘要

Support