利用10亿个人设进行合成数据创建的扩展
Scaling Synthetic Data Creation with 1,000,000,000 Personas
June 28, 2024
作者: Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu
cs.AI
摘要
我们提出了一种新颖的以人物为驱动的数据合成方法,利用大型语言模型(LLM)内的各种视角来创建多样化的合成数据。为了充分发挥这种方法的规模优势,我们引入了Persona Hub - 一个从网络数据中自动策划出的包含10亿多样化人物的集合。这10亿个人物(约占全球总人口的13%),作为世界知识的分布式载体,可以利用LLM内几乎每个视角,从而促进在各种场景下规模化地创建多样化的合成数据。通过展示Persona Hub在合成高质量数学和逻辑推理问题、说明(即用户提示)、知识丰富的文本、游戏NPC以及规模化工具(函数)方面的用例,我们证明了以人物为驱动的数据合成是多才多艺、可扩展、灵活且易于使用的,可能推动合成数据创建和实际应用中的范式转变,这可能会对LLM研究和发展产生深远影响。
English
We propose a novel persona-driven data synthesis methodology that leverages
various perspectives within a large language model (LLM) to create diverse
synthetic data. To fully exploit this methodology at scale, we introduce
Persona Hub -- a collection of 1 billion diverse personas automatically curated
from web data. These 1 billion personas (~13% of the world's total population),
acting as distributed carriers of world knowledge, can tap into almost every
perspective encapsulated within the LLM, thereby facilitating the creation of
diverse synthetic data at scale for various scenarios. By showcasing Persona
Hub's use cases in synthesizing high-quality mathematical and logical reasoning
problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs
and tools (functions) at scale, we demonstrate persona-driven data synthesis is
versatile, scalable, flexible, and easy to use, potentially driving a paradigm
shift in synthetic data creation and applications in practice, which may have a
profound impact on LLM research and development.Summary
AI-Generated Summary