AgentInstruct：朝向具有主体流的生成式教学

摘要

合成数据对于加速语言模型的开发变得越来越重要，无论是大型还是小型模型。尽管有几个成功的用例，研究人员也提出了关于模型崩溃和模仿其他模型的缺点的担忧。这种差异可以归因于合成数据在质量和多样性上的变化。有效利用合成数据通常需要大量人力来筛选数据。我们专注于将合成数据用于后训练，特别是通过强大模型创建数据，以教授新技能或行为给另一个模型，我们将这种情境称为生成式教学。我们介绍了AgentInstruct，一个可扩展的主体框架，用于自动创建大量多样且高质量的合成数据。AgentInstruct可以创建提示和响应，仅使用文本文档和代码文件等原始数据源作为种子。我们通过创建一个后训练数据集，包含2500万对，用于教授语言模型不同的技能，如文本编辑、创意写作、工具使用、编码、阅读理解等，展示了AgentInstruct的实用性。该数据集可用于任何基础模型的指导调整。我们使用这些数据对Mistral-7b进行后训练。将结果模型Orca-3与Mistral-7b-Instruct（使用相同基础模型）进行比较，我们观察到在许多基准测试中有显著改进。例如，在AGIEval上有40%的改进，在MMLU上有19%的改进，在GSM8K上有54%的改进，在BBH上有38%的改进，在AlpacaEval上有45%的改进。此外，它始终优于其他模型，如LLAMA-8B-instruct和GPT-3.5-turbo。

English

Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.

AgentInstruct：朝向具有主体流的生成式教学

AgentInstruct: Toward Generative Teaching with Agentic Flows

摘要

Support