MURI: 逆命令を介した低リソース言語向けの高品質な命令チューニングデータセット

要旨

指示チューニングは、多様なタスクにおいて人間の好みと一致するように大規模言語モデル（LLM）を向上させます。低リソース言語向けの指示チューニングデータセットを作成する従来のアプローチは、データ注釈に依存しているため、深刻な課題に直面しています。本研究では、人間の注釈者や事前に存在する多言語モデルを必要とせずに、低リソース言語向けの高品質な指示チューニングデータセットを生成する革新的な手法であるMultilingual Reverse Instructions（MURI）を紹介します。逆指示と翻訳パイプラインを活用して、MURIは低リソース言語の既存の人間によって書かれたテキストから指示と出力のペアを生成します。この手法は、異なるネイティブドメインからテキストを取得し、不適切なコンテンツを排除するフィルタを適用することで、文化的な関連性と多様性を確保します。当社のデータセットであるMURI-ITには、200以上の言語で2百万以上の指示と出力のペアが含まれています。ネイティブスピーカーによる評価とmT5モデルを用いた微調整実験により、この手法がNLUおよびオープンエンド生成の両方において効果的であることが示されています。当社はデータセットとモデルをhttps://github.com/akoksal/muri で公開しています。

English

Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach's effectiveness for both NLU and open-ended generation. We publicly release datasets and models at https://github.com/akoksal/muri.

MURI: 逆命令を介した低リソース言語向けの高品質な命令チューニングデータセット

MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

要旨

Support