我們的模型是建立在哪些模型之上的？審視現代大型語言模型中的隱性依賴關係

摘要

現代大型語言模型（LLM）的訓練流程日益依賴其他模型來生成資料、過濾語料庫、判斷輸出成果，並引導開發決策。這種依賴關係具有遞迴性：某個模型可能依賴於上游的構件，而該構件自身的依賴關係僅記錄在獨立的發行版與構件中。因此，完整的依賴結構分散於異質的公開構件間，其複雜度與遞迴深度遠超人類的追蹤能力。我們提出 ModSleuth，這是一套具備代理能力的系統，可基於來源證據，從公開構件中遞迴式重建 LLM 的依賴關係圖。我們發現，現階段的主要挑戰已非資訊擷取，而是定義何謂依賴關係，並在不一致的文檔中調和構件參照。我們透過形式化方法來應對這些挑戰：區分直接與間接依賴、以操作為中心的關係來呈現異質管線角色，並在名稱、版本與儲存庫之間解析構件身分。將 ModSleuth 應用於四個富含公開構件的 LLM 發行版後，我們成功回取得 1,060 項來源可驗證的依賴關係，並建構出現代 LLM 開發的大型依賴關係圖。這些圖表揭露了多跳授權義務、訓練與評估的耦合、發行版與訓練時構件之間的差異，以及難以透過其他方式發現的文檔不一致問題。我們釋出 ModSleuth 及其產生的依賴關係圖，以支援對現代 LLM 日趨複雜的生態系統進行透明化的分析。

English

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.