AgentInstruct: エージェント的フローを用いた生成的教授法に向けて

要旨

合成データは、大規模および小規模な言語モデルの開発を加速する上でますます重要になっています。いくつかの成功事例がある一方で、研究者たちはモデルの崩壊や他のモデルを模倣することの欠点について懸念を表明しています。この不一致は、合成データの品質と多様性が大きく異なるという事実に起因しています。合成データを効果的に使用するためには、通常、データをキュレーションするために多大な人的労力が必要です。私たちは、特に強力なモデルが別のモデルに新しいスキルや振る舞いを教えるためにデータを作成する、ポストトレーニングのための合成データの使用に焦点を当て、この設定を「生成的教授法（Generative Teaching）」と呼びます。私たちは、多様で高品質な合成データを自動的に大量に作成するための拡張可能なエージェントフレームワークであるAgentInstructを紹介します。AgentInstructは、テキストドキュメントやコードファイルなどの生データをシードとして使用して、プロンプトとレスポンスの両方を作成できます。私たちは、テキスト編集、創造的な執筆、ツールの使用、コーディング、読解力などのさまざまなスキルを言語モデルに教えるための25Mペアのポストトレーニングデータセットを作成することで、AgentInstructの有用性を実証します。このデータセットは、任意のベースモデルの指示チューニングに使用できます。私たちは、このデータを使用してMistral-7bをポストトレーニングしました。結果として得られたモデルOrca-3をMistral-7b-Instruct（同じベースモデルを使用）と比較すると、多くのベンチマークで大幅な改善が観察されました。例えば、AGIEvalで40%、MMLUで19%、GSM8Kで54%、BBHで38%、AlpacaEvalで45%の改善が見られました。さらに、LLAMA-8B-instructやGPT-3.5-turboなどの他のモデルを一貫して上回りました。

English

Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.

AgentInstruct: エージェント的フローを用いた生成的教授法に向けて

AgentInstruct: Toward Generative Teaching with Agentic Flows

要旨

Support