MaLA-500: 大規模言語モデルの大規模言語適応

要旨

大規模言語モデルは自然言語処理の最先端を進化させてきた。しかし、その設計が英語や限られた言語に偏っているため、低リソース言語における有効性には大きな隔たりが生じている。この隔たりを埋めるため、我々は534言語を広範にカバーする新たな大規模言語モデルMaLA-500を提案する。MaLA-500の学習には、LLaMA 2を基盤とした語彙拡張とGlot500-cを用いた継続事前学習を採用した。SIB-200での実験結果から、MaLA-500はインコンテキスト学習において最先端の性能を達成することが示された。MaLA-500はhttps://huggingface.co/MaLA-LMで公開されている。

English

Large language models have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our experiments on SIB-200 show that MaLA-500 achieves state-of-the-art in-context learning results. We release MaLA-500 at https://huggingface.co/MaLA-LM

MaLA-500: 大規模言語モデルの大規模言語適応

MaLA-500: Massive Language Adaptation of Large Language Models

要旨

Support