我们的模型建立在哪些模型之上？审视现代大语言模型中的隐形依赖关系

摘要

现代LLM训练管道日益依赖其他模型进行数据生成、语料过滤、输出评判以及开发决策指导。这些依赖关系具有递归性：一个模型可能依赖上游产物，而该产物本身的依赖关系仅记录在独立发布的构件与产物中。最终，完整的依赖结构被碎片化地分散于异构公共产物之间，其复杂性与递归深度远超人类追踪能力。我们提出ModSleuth这一代理系统，能从公共产物中递归重构基于源级实证证据的LLM依赖图。研究发现，当前主要挑战已非信息提取，而在于定义何种关系构成依赖，以及协调不一致文档中的构件引用。我们通过形式化框架应对这些挑战：区分直接与间接依赖，通过操作中心关系表征异构管道角色，并跨名称、版本与仓库解析构件身份。将ModSleuth应用于四个富含公共产物的LLM发布版本，我们恢复了1,060个经源验证的依赖关系，构建出现代LLM开发的大规模依赖图。这些图谱揭示了多跳许可义务、训练-评估耦合、发布产物与训练时构件的差异，以及难以通过常规手段发现的文档不一致性。我们开源ModSleuth及其生成的依赖图，以支持对现代LLM日益复杂生态系统的透明化分析。

English

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.