Only-IF：指導の多様性が汎化に与える決定的な影響を明らかにする

要旨

大規模言語モデル（LLM）が様々なタスクで効果的に機能するためには、指示を理解し正確に遵守することが重要です。本研究では、未知の指示に汎化するための主要要因を厳密に検証し、指示チューニングのためのデータ収集を導く洞察を提供します。チューリング完全マルコフアルゴリズムに触発された制御された実験を通じて、そのような汎化は、トレーニングデータが意味領域を横断して十分に多様化している場合にのみ現れることを示します。私たちの調査結果は、限られた領域内での多様化だけでは堅牢な汎化を保証できないことも明らかにします。それに対し、領域間のデータ多様化は、データ予算が制約されていても、モデルの適応性を大幅に向上させます。さらに、$textbf{専門家}$および$textbf{汎用}$モデルの微調整を含む実世界のシナリオに私たちの分析を拡張します。両方のケースで、確立されたデータセットの多様性を増やすことで性能を向上させることができること、およびデータをスケーリングする際には、指示の意味を多様化させることが、単に類似データの量を増やすよりも効果的であることを示します。私たちの研究は、特に専門家および汎用シナリオのためにトレーニングデータを拡張することでモデルの性能を最適化する際に、データセットの収集に重要な洞察を提供します。データ多様化の慎重な考慮が重要であることを示し、専門家モデルをコアドメインを超えるデータでトレーニングすることが、性能の大幅な向上につながり、一方、汎用モデルは、幅広いアプリケーションでの全体的な指示遵守能力を向上させる多様なデータ組み合わせから利益を得ることを強調します。私たちの結果は、戦略的多様化の重要性を浮き彫りにし、データ品質の向上に向けた明確なガイドラインを提供します。

English

Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization only emerges when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model's adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of $textbf{specialist} and textbf{generalist}$ models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.

Only-IF：指導の多様性が汎化に与える決定的な影響を明らかにする

Only-IF:Revealing the Decisive Effect of Instruction Diversity on Generalization

要旨

Support