从Hugging Face到GitHub:追踪开源AI生态系统中的许可证漂移
From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem
September 11, 2025
作者: James Jewitt, Hao Li, Bram Adams, Gopi Krishnan Rajbahadur, Ahmed E. Hassan
cs.AI
摘要
开源AI生态系统中隐藏的许可证冲突带来了严重的法律和道德风险,使组织面临潜在诉讼,用户则暴露于未披露的风险之中。然而,该领域缺乏对这些冲突发生频率、起源地及受影响最严重社区的数据驱动理解。我们首次对Hugging Face上的数据集和模型许可证及其下游集成至开源软件应用进行了端到端审计,涵盖36.4万个数据集、160万个模型及14万个GitHub项目。我们的实证分析揭示了系统性的违规现象,其中35.5%的模型到应用转换通过重新许可在宽松条款下,移除了限制性许可条款。此外,我们原型化了一个可扩展的规则引擎,该引擎编码了近200条SPDX及模型特定条款,用于检测许可证冲突,能够解决软件应用中86.4%的许可证冲突问题。为支持未来研究,我们公开了数据集及原型引擎。本研究强调许可证合规是开源AI中的一项关键治理挑战,并提供了实现自动化、AI感知的大规模合规所需的数据与工具。
English
Hidden license conflicts in the open-source AI ecosystem pose serious legal
and ethical risks, exposing organizations to potential litigation and users to
undisclosed risk. However, the field lacks a data-driven understanding of how
frequently these conflicts occur, where they originate, and which communities
are most affected. We present the first end-to-end audit of licenses for
datasets and models on Hugging Face, as well as their downstream integration
into open-source software applications, covering 364 thousand datasets, 1.6
million models, and 140 thousand GitHub projects. Our empirical analysis
reveals systemic non-compliance in which 35.5% of model-to-application
transitions eliminate restrictive license clauses by relicensing under
permissive terms. In addition, we prototype an extensible rule engine that
encodes almost 200 SPDX and model-specific clauses for detecting license
conflicts, which can solve 86.4% of license conflicts in software applications.
To support future research, we release our dataset and the prototype engine.
Our study highlights license compliance as a critical governance challenge in
open-source AI and provides both the data and tools necessary to enable
automated, AI-aware compliance at scale.