溯源探本：一个用于追踪后训练大语言模型数据谱系的多智能体框架

摘要

后训练数据在塑造大语言模型能力方面具有关键作用，但数据集常被视作孤立产物，忽略了其演进过程中的系统性关联。为解析这些复杂关系，我们将数据谱系概念引入LLM生态系统，并提出一种自动化多智能体框架来重构数据集发展的演化图谱。通过大规模谱系分析，我们揭示了领域特定的结构模式，例如数学导向数据集中的纵向精细化与通用领域语料库中的横向聚合化。更重要的是，我们发现了普遍存在的系统性问题，包括由隐式数据集交叉引发的结构冗余，以及基准污染沿谱系路径的传播现象。为验证谱系分析在数据构建中的实用价值，我们利用重构的谱系图创建了面向多样性的谱系感知数据集。通过将指令采样锚定于上游根源，该方法有效缓解了下游同质化和隐性冗余问题，生成更具多样性的后训练语料。我们进一步证明，针对大规模数据生态系统，以谱系为核心的分析可成为样本级数据集对比的高效稳健拓扑替代方案。通过将数据构建建立在显性谱系结构之上，本研究推动后训练数据管理迈向更系统化、可控化的新范式。

English

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of data lineage to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including structural redundancy induced by implicit dataset intersections and the propagation of benchmark contamination along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a lineage-aware diversity-oriented dataset. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.