Aya数据集：用于多语言指导的开放获取集合调整

摘要

数据集是现代人工智能许多突破的基础。自然语言处理领域的许多最新成就归功于在多样化任务集上微调预训练模型，使大型语言模型（LLM）能够响应指令。指令微调（IFT）需要专门构建和注释的数据集。然而，现有数据集几乎都是以英语为主。在这项工作中，我们的主要目标是通过构建一个跨越65种语言的人工筛选的指令遵循数据集来弥合语言差距。我们与世界各地讲流利语言的人合作，收集指令和完成的自然实例。此外，我们通过模板化和翻译现有数据集，创建迄今为止最广泛的多语言集合，涵盖114种语言，共包括5.13亿个实例。总体上，我们提供了四个关键资源：我们开发并开源Aya注释平台、Aya数据集、Aya收集和Aya评估套件。Aya倡议还作为一个有价值的参与性研究案例研究，涉及来自119个国家的合作者。我们认为这是未来研究合作的有价值框架，旨在弥合资源方面的差距。

English

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

Aya数据集：用于多语言指导的开放获取集合调整

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

摘要

Support