合成データ（ほぼ）ゼロから：言語モデルのための汎用指示チューニング

要旨

一般化された指示チューニング（GLANと称する）を紹介する。これは大規模言語モデル（LLM）の指示チューニングに対する一般的でスケーラブルな手法である。従来の研究がシード例や既存のデータセットに依存して指示チューニングデータを構築するのに対し、GLANは事前にキュレーションされた人間の知識と能力の分類体系を入力として排他的に利用し、すべての学問分野にわたる大規模な合成指示データを生成する。具体的には、人間の教育システムにおける体系的な構造に着想を得て、LLMを活用して人間の知識と能力をさまざまな分野、サブ分野、そして最終的には個別の学問分野に半自動的に分解することで分類体系を構築する。その後、各学問分野に対して包括的な科目リストを生成し、再びLLMを活用して各科目に特化したシラバスを設計する。シラバスの各授業セッションに詳細化された細かいキーコンセプトを用いることで、人間の知識とスキルの全範囲にわたる多様な指示を生成することが可能となる。Mistralなどの大規模言語モデルを用いた広範な実験により、GLANが数学的推論、コーディング、学術試験、論理的推論から一般的な指示追従に至るまで、これらのタスクに特化したトレーニングデータを使用せずに複数の次元で優れていることが実証された。さらに、GLANは容易にカスタマイズが可能であり、新しい分野やスキルを分類体系に新たなノードとして組み込むだけで追加することができる。

English

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.

合成データ（ほぼ）ゼロから：言語モデルのための汎用指示チューニング

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

要旨

Support