ChatPaper.aiChatPaper

MURI:通过反向指令为低资源语言提供高质量指令调优数据集

MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

September 19, 2024
作者: Abdullatif Köksal, Marion Thaler, Ayyoob Imani, Ahmet Üstün, Anna Korhonen, Hinrich Schütze
cs.AI

摘要

指导调优通过将大型语言模型(LLMs)与人类偏好在各种任务中保持一致来增强其性能。传统方法创建指导调优数据集在低资源语言中面临严重挑战,因为这些方法依赖于数据注释。本研究引入了一种新方法,名为多语言反向指导(MURI),它能够为低资源语言生成高质量的指导调优数据集,而无需人工标注或现有多语言模型。MURI利用反向指导和翻译流程,从现有的低资源语言人工撰写的文本中生成指导-输出对。该方法通过从不同的本地领域获取文本并应用过滤器来消除不当内容,确保文化相关性和多样性。我们的数据集,MURI-IT,涵盖了200种语言中超过2百万个指导-输出对。由母语使用者进行评估以及与mT5模型的微调实验表明该方法在自然语言理解和开放式生成方面的有效性。我们在https://github.com/akoksal/muri 上公开发布数据集和模型。
English
Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach's effectiveness for both NLU and open-ended generation. We publicly release datasets and models at https://github.com/akoksal/muri.

Summary

AI-Generated Summary

PDF83November 16, 2024