우리 모델들은 어떤 모델 위에 구축되었는가?: 현대 LLM에서 보이지 않는 의존성 감사

초록

현대 LLM 학습 파이프라인은 점차 다른 모델에 의존하여 데이터를 생성하고, 코퍼스를 필터링하며, 출력을 평가하고, 개발 결정을 안내하고 있다. 이러한 의존성은 재귀적이다. 즉, 어떤 모델은 상위 아티팩트에 의존할 수 있는데, 해당 상위 아티팩트 자체의 의존성은 별도의 릴리스와 아티팩트에만 문서화되어 있다. 그 결과, 전체 의존성 구조는 이질적인 공개 아티팩트에 걸쳐 파편화되어 있으며, 복잡성과 재귀적 깊이가 인간의 추적 능력을 훨씬 초월한다. 우리는 ModSleuth를 소개한다. 이는 에이전틱 시스템으로, 공개 아티팩트로부터 출처 기반 증거와 함께 LLM 의존성 그래프를 재귀적으로 재구성한다. 우리는 주요 과제가 더 이상 정보 추출이 아니라, 의존성을 구성하는 요소를 정의하고 일관되지 않은 문서 전반에 걸친 아티팩트 참조를 조정하는 데 있음을 발견했다. 우리는 이러한 과제를 직접 의존성과 간접 의존성을 구분하고, 작업 중심 관계를 통해 이질적인 파이프라인 역할을 표현하며, 이름, 버전, 저장소 전반에 걸친 아티팩트 식별을 해결하는 공식화를 통해 해결한다. ModSleuth를 공개 아티팩트가 풍부한 네 가지 LLM 릴리스에 적용하여 1,060개의 출처 검증된 의존성을 복구하고, 현대 LLM 개발의 대규모 의존성 그래프를 구축했다. 이러한 그래프는 다중 홉 라이선스 의무, 학습-평가 결합, 릴리스된 아티팩트와 학습 시점 아티팩트 간의 불일치, 그리고 달리 발견하기 어려웠을 문서화 불일치를 드러낸다. 우리는 ModSleuth와 그 결과로 생성된 의존성 그래프를 공개하여 현대 LLM의 기반이 되는 점점 더 복잡해지는 생태계의 투명한 분석을 지원한다.

English

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.