從真實到合成：透過屬性基礎合成數百萬條多樣化且複雜的用戶指令

摘要

追求多樣化、複雜且大規模的指令數據對於自動對齊大型語言模型（LLMs）至關重要。雖然存在能夠大規模生成合成指令的方法，但它們要么受限於有限的基礎來源，導致分佈狹窄，要么依賴於無法在複雜性方面產生有意義軌跡的簡單擴展。相比之下，有助於高效對齊的指令通常基於認知洞察力精心設計，並紮根於現實世界的使用案例。在本文中，我們利用屬性基礎來合成這類指令，該方法包括：1）自上而下的屬性過程，將一組精選的真實指令與特定用戶情境相結合；2）自下而上的合成過程，利用網絡文檔首先生成情境，再生成有意義的指令。這一框架使我們能夠利用廣泛的網絡文檔，大規模地收穫多樣且複雜的指令。具體而言，我們構建了一個包含100萬條指令的數據集，名為SynthQuestions，並證明基於其訓練的模型在多個常見基準測試中取得了領先性能，且隨著更多網絡語料庫的加入，性能持續提升。數據、模型和代碼將在https://github.com/Ignoramus0817/SynthQuestions 上公開。

English

The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at https://github.com/Ignoramus0817/SynthQuestions.

從真實到合成：透過屬性基礎合成數百萬條多樣化且複雜的用戶指令

From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

摘要

Support