通過十億個虛構人物進行合成數據創建的擴展
Scaling Synthetic Data Creation with 1,000,000,000 Personas
June 28, 2024
作者: Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu
cs.AI
摘要
我們提出了一種新穎的以人物為驅動的數據合成方法,利用大型語言模型(LLM)內的各種觀點來創建多樣化的合成數據。為了充分應用這種方法,我們引入了Persona Hub —— 這是一億個多樣化人物的集合,是從網絡數據中自動匯總而成。這一億個人物(約占世界總人口的13%),作為世界知識的分佈式傳輸者,可以利用LLM中幾乎每個觀點,從而促進在各種場景中大規模創建多樣化的合成數據。通過展示Persona Hub 在合成高質量數學和邏輯推理問題、指導(即用戶提示)、知識豐富的文本、遊戲NPC和工具(函數)等方面的應用案例,我們證明以人物為驅動的數據合成是多功能、可擴展、靈活且易於使用的,可能引領合成數據創建和實際應用方面的範式轉變,對LLM研究和開發產生深遠影響。
English
We propose a novel persona-driven data synthesis methodology that leverages
various perspectives within a large language model (LLM) to create diverse
synthetic data. To fully exploit this methodology at scale, we introduce
Persona Hub -- a collection of 1 billion diverse personas automatically curated
from web data. These 1 billion personas (~13% of the world's total population),
acting as distributed carriers of world knowledge, can tap into almost every
perspective encapsulated within the LLM, thereby facilitating the creation of
diverse synthetic data at scale for various scenarios. By showcasing Persona
Hub's use cases in synthesizing high-quality mathematical and logical reasoning
problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs
and tools (functions) at scale, we demonstrate persona-driven data synthesis is
versatile, scalable, flexible, and easy to use, potentially driving a paradigm
shift in synthetic data creation and applications in practice, which may have a
profound impact on LLM research and development.Summary
AI-Generated Summary