SeaLLMs 3: 東南アジア言語向けのオープン基盤およびチャット多言語大規模言語モデル

要旨

大規模言語モデル（LLMs）は様々なタスクで顕著な能力を示してきたが、その開発は主に英語や中国語といった高リソース言語に集中しており、低リソース言語は十分な支援を受けていない。この格差を解消するため、我々は東南アジア言語に特化したSeaLLMsモデルファミリーの最新版であるSeaLLMs 3を提案する。この地域は豊かな言語的多様性を特徴とするが、適切な言語技術支援が不足している。SeaLLMs 3は、英語、中国語、インドネシア語、ベトナム語、タイ語、タガログ語、マレー語、ビルマ語、クメール語、ラオス語、タミル語、ジャワ語など、この地域で話される広範な言語をカバーすることで、このギャップを埋めることを目指している。効率的な言語強化技術と特別に構築された指示チューニングデータセットを活用することで、SeaLLMs 3は高い性能と汎用性を維持しながら、トレーニングコストを大幅に削減している。我々のモデルは、世界知識、数学的推論、翻訳、指示追従などのタスクで優れた性能を発揮し、同規模のモデルの中で最先端の性能を達成している。さらに、一般的な考慮事項と文化固有の考慮事項の両方に対処し、幻覚を減らすメカニズムを組み込むことで、安全性と信頼性を優先した。この研究は、包括的なAIの重要性を強調し、先進的なLLMの能力が支援を受けていない言語的・文化的コミュニティにも恩恵をもたらし得ることを示している。

English

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

SeaLLMs 3: 東南アジア言語向けのオープン基盤およびチャット多言語大規模言語モデル

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

要旨

Support