ChatPaper.aiChatPaper

从真实到合成:通过属性标注生成数百万条多样化且复杂的用户指令

From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

June 4, 2025
作者: Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao
cs.AI

摘要

追求多样化、复杂且大规模的指令数据对于自动对齐大型语言模型(LLMs)至关重要。尽管存在能够大规模生成合成指令的方法,但它们要么受限于有限的来源,导致分布狭窄,要么依赖于简单的扩展,无法在复杂性方面产生有意义的轨迹。相比之下,有助于高效对齐的指令通常基于认知洞察精心设计,并扎根于现实世界的应用场景。在本文中,我们通过属性化基础来合成此类指令,这一过程包括:1)自上而下的属性化过程,将一组精选的真实指令与特定用户情境相关联;2)自下而上的合成过程,利用网络文档首先生成情境,进而生成有意义的指令。这一框架使我们能够利用广泛的网络文档,大规模地收集多样且复杂的指令。具体而言,我们构建了一个包含100万条指令的数据集,命名为SynthQuestions,并证明基于该数据集训练的模型在多个常见基准测试中均取得了领先性能,且随着使用更多网络语料库,性能持续提升。数据、模型及代码将发布于https://github.com/Ignoramus0817/SynthQuestions。
English
The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at https://github.com/Ignoramus0817/SynthQuestions.
PDF152June 17, 2025