追本溯源：一個用於追蹤後訓練大型語言模型資料譜系的多智能體框架

摘要

後訓練資料在塑造大型語言模型能力方面起著關鍵作用，然而資料集常被視為孤立產物，忽略了其演進背後的系統性關聯。為釐清這些複雜關係，我們將資料譜系概念引入LLM生態系統，提出自動化多智能體框架以重建資料集發展的演化圖譜。透過大規模譜系分析，我們揭示了特定領域的結構模式，例如數學導向資料集中的垂直細化與通用領域語料庫的水平聚合。更進一步，我們發現了普遍存在的系統性問題，包括隱性資料集交集導致的結構冗餘，以及基準污染沿譜系路徑的傳播現象。為驗證譜系分析在資料建構中的實用價值，我們利用重建的譜系圖創建了具譜系感知的多元導向資料集。透過在上游根源錨定指令採樣，此方法有效緩解了下游同質化與隱性冗餘問題，產出多樣性更高的後訓練語料。我們更強調以譜系為核心的分析可作為大規模資料生態中樣本級資料比對的高效穩健拓撲替代方案。透過將資料建構奠基於顯性譜系結構，本研究推動後訓練資料管理邁向更系統化與可控的新範式。

English

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of data lineage to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including structural redundancy induced by implicit dataset intersections and the propagation of benchmark contamination along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a lineage-aware diversity-oriented dataset. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.

追本溯源：一個用於追蹤後訓練大型語言模型資料譜系的多智能體框架

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

摘要

Support