從Hugging Face到GitHub：追蹤開源AI生態系統中的許可證漂移

摘要

在開源AI生態系統中，隱藏的授權衝突構成了嚴重的法律與道德風險，使組織面臨潛在訴訟，並讓用戶暴露於未揭露的風險之中。然而，該領域尚缺乏數據驅動的理解，以了解這些衝突發生的頻率、其起源地以及哪些社群最受影響。我們首次對Hugging Face上的數據集與模型授權進行了端到端的審計，並追蹤它們在下游開源軟件應用中的整合情況，涵蓋了36.4萬個數據集、160萬個模型及14萬個GitHub項目。我們的實證分析揭示了系統性的不合規現象，其中35.5%的模型到應用的轉換通過在寬鬆條款下重新授權，消除了限制性授權條款。此外，我們原型化了一個可擴展的規則引擎，該引擎編碼了近200條SPDX及模型特定的條款，用於檢測授權衝突，能夠解決軟件應用中86.4%的授權衝突問題。為支持未來研究，我們公開了我們的數據集及原型引擎。本研究強調了授權合規作為開源AI中一項關鍵治理挑戰的重要性，並提供了實現自動化、AI感知的大規模合規所需的數據與工具。

English

Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale.

從Hugging Face到GitHub：追蹤開源AI生態系統中的許可證漂移

From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem

摘要

Support