MetaSynth: 多様な合成データ生成のためのメタプロンプト駆動型エージェントスキャフォールド

要旨

最近のPhi-3.5やPhi-4のような小規模な言語モデルは、大規模な言語モデルを用いて生成された合成データに依存しています。しかし、特定のドメインにLLMを適応させるといった他のユースケースにおいて合成データを活用する方法については、未だに疑問が残っています。合成データの主な制限は多様性の低さであり、これは他のモデルを改善するための下流適用性に悪影響を及ぼします。この問題を解決するため、我々はMetaSynthを提案します。MetaSynthは、メタプロンプティングを通じて多様性を高める合成データ生成手法であり、言語モデルが複数の「専門家」LLMエージェントを調整して協調的にデータを生成します。MetaSynthで生成されたわずか2500万トークンの合成データを使用して、我々はよく訓練されたLLM（Mistral-7B-v0.3）を金融と生物医学という2つの専門ドメインに適応させることに成功し、その結果得られたモデルの一般的なタスクにおける能力を損なうことなく達成しました。さらに、我々は7つの自動化された指標を用いて合成データの多様性を評価し、それがLLMの事前学習コーパスの多様性に近づいていることを確認しました。 MetaSynthを用いてMistral-7B-v0.3を継続的に事前学習させた結果、ベースのLLMを大幅に上回り、金融では最大4.08%、生物医学では最大13.75%の改善を示しました。同じモデルをテンプレートプロンプトで生成されたデータで訓練した場合、そのテンプレートに以前の生成や実データの多様なIn-Context例が含まれていても、性能が低下しました。我々の研究結果は、MetaSynthを使用する場合、実データを混ぜることなく、わずか数百万トークンの多様な合成データが効果的なドメイン適応に十分であることを示唆しています。

English

Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

MetaSynth: 多様な合成データ生成のためのメタプロンプト駆動型エージェントスキャフォールド

MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

要旨

Support