精细指令:将合成指令扩展至预训练规模
FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale
January 29, 2026
作者: Ajay Patel, Colin Raffel, Chris Callison-Burch
cs.AI
摘要
由于监督式训练数据有限,大型语言模型(LLMs)通常通过自监督的"预测下一词"目标,在海量非结构化文本数据上进行预训练。为使训练后的模型能有效服务用户,还需使用规模小得多的"指令微调"数据(即由指令与回应组成的监督训练样本)进行进一步训练。为突破监督数据量的限制,我们提出一种方法,可将互联网级预训练文档中的知识转化为数十亿条合成的指令-答案训练对。由此产生的名为FineInstructions的数据集,利用约1800万条基于真实用户查询和提示创建的指令模板,通过匹配非结构化预训练语料中的人工撰写源文档并进行实例化。借助如此规模的合成"监督"训练数据,大型语言模型可完全基于指令微调目标从头开始预训练,这与LLMs的下游应用场景(回应用户提示)具有更高的分布一致性。我们进行了严格的逐词元训练对照实验,发现在衡量自由形式回应质量的标准基准测试中,基于FineInstructions的预训练效果优于标准预训练及其他已提出的合成预训练技术。相关资源详见https://huggingface.co/fineinstructions。
English
Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .