AgentInstruct:朝向具有主体流的生成式教学
AgentInstruct: Toward Generative Teaching with Agentic Flows
July 3, 2024
作者: Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah
cs.AI
摘要
合成数据对于加速语言模型的开发变得越来越重要,无论是大型还是小型模型。尽管有几个成功的用例,研究人员也提出了关于模型崩溃和模仿其他模型的缺点的担忧。这种差异可以归因于合成数据在质量和多样性上的变化。有效利用合成数据通常需要大量人力来筛选数据。我们专注于将合成数据用于后训练,特别是通过强大模型创建数据,以教授新技能或行为给另一个模型,我们将这种情境称为生成式教学。我们介绍了AgentInstruct,一个可扩展的主体框架,用于自动创建大量多样且高质量的合成数据。AgentInstruct可以创建提示和响应,仅使用文本文档和代码文件等原始数据源作为种子。我们通过创建一个后训练数据集,包含2500万对,用于教授语言模型不同的技能,如文本编辑、创意写作、工具使用、编码、阅读理解等,展示了AgentInstruct的实用性。该数据集可用于任何基础模型的指导调整。我们使用这些数据对Mistral-7b进行后训练。将结果模型Orca-3与Mistral-7b-Instruct(使用相同基础模型)进行比较,我们观察到在许多基准测试中有显著改进。例如,在AGIEval上有40%的改进,在MMLU上有19%的改进,在GSM8K上有54%的改进,在BBH上有38%的改进,在AlpacaEval上有45%的改进。此外,它始终优于其他模型,如LLAMA-8B-instruct和GPT-3.5-turbo。
English
Synthetic data is becoming increasingly important for accelerating the
development of language models, both large and small. Despite several
successful use cases, researchers also raised concerns around model collapse
and drawbacks of imitating other models. This discrepancy can be attributed to
the fact that synthetic data varies in quality and diversity. Effective use of
synthetic data usually requires significant human effort in curating the data.
We focus on using synthetic data for post-training, specifically creating data
by powerful models to teach a new skill or behavior to another model, we refer
to this setting as Generative Teaching. We introduce AgentInstruct, an
extensible agentic framework for automatically creating large amounts of
diverse and high-quality synthetic data. AgentInstruct can create both the
prompts and responses, using only raw data sources like text documents and code
files as seeds. We demonstrate the utility of AgentInstruct by creating a post
training dataset of 25M pairs to teach language models different skills, such
as text editing, creative writing, tool usage, coding, reading comprehension,
etc. The dataset can be used for instruction tuning of any base model. We
post-train Mistral-7b with the data. When comparing the resulting model Orca-3
to Mistral-7b-Instruct (which uses the same base model), we observe significant
improvements across many benchmarks. For example, 40% improvement on AGIEval,
19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and
45% improvement on AlpacaEval. Additionally, it consistently outperforms other
models such as LLAMA-8B-instruct and GPT-3.5-turbo.Summary
AI-Generated Summary