Aya數據集：用於多語言指導的開放存取收集調整

摘要

資料集是現代人工智慧許多突破的基礎。自然語言處理（NLP）領域的許多最新成就可歸因於對預先訓練模型進行微調，使大型語言模型（LLM）能夠回應指令的多樣任務集。指令微調（IFT）需要特別構建和標註的資料集。然而，現有的資料集幾乎都是以英語為主。在這項工作中，我們的主要目標是通過建立一個人工精選的指令遵循資料集，涵蓋65種語言，以彌合語言差距。我們與來自世界各地的母語使用者合作，收集指令和完成的自然實例。此外，我們通過模板化和翻譯現有資料集，跨越114種語言，創建迄今為止最廣泛的多語言收集，包括5.13億個實例。總共，我們貢獻了四個關鍵資源：我們開發並開源Aya標註平台、Aya資料集、Aya收集和Aya評估套件。Aya倡議也作為參與式研究的寶貴案例研究，涉及來自119個國家的合作者。我們認為這是未來研究合作的寶貴框架，旨在彌合資源差距。

English

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

Aya數據集：用於多語言指導的開放存取收集調整

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

摘要

Support