トレーシング・ザ・ルーツ：学習後大規模言語モデルにおけるデータ系統を解明するマルチエージェントフレームワーク

要旨

ポストトレーニングデータは大規模言語モデル（LLM）の能力形成において極めて重要な役割を果たすが、データセットは往々にして孤立した成果物として扱われ、その進化を支える体系的な連関が見落とされがちである。これらの複雑な関係性を解きほぐすため、我々はLLMエコシステムにデータ系統の概念を導入し、データセット開発の進化グラフを再構築する自動化マルチエージェントフレームワークを提案する。大規模な系統分析を通じて、数学指向データセットにおける垂直的な精緻化や一般領域コーパスにおける水平的な集約など、領域特有の構造的パターンを明らかにする。さらに、暗黙的なデータセット交差に起因する構造的冗長性や、系統経路に沿ったベンチマーク汚染の伝播といった、広範に存在する体系的問題を発見した。データ構築における系統分析の実用的価値を示すため、再構築した系統グラフを活用して、系統を意識した多様性指向のデータセットを構築する。上流のルートソースを指令サンプリングの拠点とすることで、下流での均質化や潜在的な冗長性を緩和し、より多様性に富むポストトレーニングコーパスを生成する手法を提案する。さらに大規模データエコシステムにおいて、サンプル単位のデータセット比較に代わる効率的かつ頑健な位相論的代替手法として、系統中心の分析の有効性を強調する。明示的な系統構造に基づいたデータ構築により、本研究はポストトレーニングデータの管理をより体系的で制御可能なパラダイムへと推進するものである。

English

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of data lineage to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including structural redundancy induced by implicit dataset intersections and the propagation of benchmark contamination along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a lineage-aware diversity-oriented dataset. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.

トレーシング・ザ・ルーツ：学習後大規模言語モデルにおけるデータ系統を解明するマルチエージェントフレームワーク

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

要旨

Support