亞歷山大計畫:透過大型語言模型解放科學知識的版權束縛
Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs
February 26, 2025
作者: Christoph Schuhmann, Gollam Rabby, Ameya Prabhu, Tawsif Ahmed, Andreas Hochlehnert, Huu Nguyen, Nick Akinci Heidrich, Ludwig Schmidt, Robert Kaczmarczyk, Sören Auer, Jenia Jitsev, Matthias Bethge
cs.AI
摘要
付費牆、許可證和版權規則往往限制了科學知識的廣泛傳播與再利用。我們主張,從法律和技術層面上,提取學術文本中的科學知識是可行的。現有的方法,如文本嵌入,未能可靠地保留事實內容,而簡單的改寫在法律上可能站不住腳。我們呼籲學界採納一個新理念:利用大型語言模型(LLMs)將學術文獻轉化為知識單元。這些單元採用結構化數據,捕捉實體、屬性及關係,而不包含風格化內容。我們提供的證據表明,知識單元:(1)基於對德國版權法和美國合理使用原則的法律分析,構建了一個在法律上可辯護的框架,用於分享受版權保護的研究文本中的知識;(2)通過對四個研究領域中原始版權文本事實的多項選擇題(MCQ)表現評估,保留了約95%的原始文本事實知識。將科學知識從版權束縛中解放出來,有望為科學研究和教育帶來變革性益處,使語言模型能夠重用受版權保護文本中的重要事實。為支持這一目標,我們分享了將研究文檔轉化為知識單元的開源工具。總體而言,我們的工作提出了在尊重版權的同時,實現科學知識民主化獲取的可行性。
English
Paywalls, licenses and copyright rules often restrict the broad dissemination
and reuse of scientific knowledge. We take the position that it is both legally
and technically feasible to extract the scientific knowledge in scholarly
texts. Current methods, like text embeddings, fail to reliably preserve factual
content, and simple paraphrasing may not be legally sound. We urge the
community to adopt a new idea: convert scholarly documents into Knowledge Units
using LLMs. These units use structured data capturing entities, attributes and
relationships without stylistic content. We provide evidence that Knowledge
Units: (1) form a legally defensible framework for sharing knowledge from
copyrighted research texts, based on legal analyses of German copyright law and
U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from
original text, measured by MCQ performance on facts from the original
copyrighted text across four research domains. Freeing scientific knowledge
from copyright promises transformative benefits for scientific research and
education by allowing language models to reuse important facts from copyrighted
text. To support this, we share open-source tools for converting research
documents into Knowledge Units. Overall, our work posits the feasibility of
democratizing access to scientific knowledge while respecting copyright.Summary
AI-Generated Summary