私たちのモデルはどのモデルに基づいて構築されているのか？現代のLLMにおける見えない依存関係の監査

要旨

現代のLLMトレーニングパイプラインは、データ生成、コーパスフィルタリング、出力評価、開発判断のガイドにおいて、ますます他のモデルに依存している。これらの依存関係は再帰的である。すなわち、あるモデルが上流のアーティファクトに依存する場合、そのアーティファクト自身の依存関係は別個のリリースやアーティファクトにのみ文書化されている。その結果、完全な依存関係構造は異種の公開アーティファクトに断片化され、その複雑性と再帰的な深さは人間の追跡能力をはるかに超えている。我々は、ソースに基づく証拠を用いて公開アーティファクトからLLM依存関係グラフを再帰的に再構築するエージェントシステムModSleuthを紹介する。我々は、主要な課題が情報抽出ではなく、依存関係を構成するものを定義し、一貫性のない文書間でアーティファクト参照を調整することであることを発見した。我々は、直接的依存関係と間接的依存関係を区別し、操作中心の関係を通じて異種のパイプライン役割を表現し、名称、バージョン、リポジトリ間でアーティファクトの識別を解決する形式化によって、これらの課題に取り組む。ModSleuthを4つの公開アーティファクトが豊富なLLMリリースに適用し、1,060のソース検証済み依存関係を回復し、現代のLLM開発の大規模依存関係グラフを構築する。これらのグラフは、マルチホップのライセンス義務、トレーニング-評価の結合、リリース時とトレーニング時のアーティファクト間の不一致、およびその他の方法では発見が困難な文書の不整合を明らかにする。我々はModSleuthと結果の依存関係グラフを公開し、現代のLLMの基盤となるますます複雑化するエコシステムの透明な分析を支援する。

English

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.