프로젝트 알렉산드리아: 대형 언어 모델을 통해 과학 지식의 저작권 제약에서 벗어나기

초록

페이월, 라이선스, 저작권 규정은 종종 과학 지식의 광범위한 보급과 재사용을 제한합니다. 우리는 학술 텍스트에서 과학 지식을 추출하는 것이 법적으로도 기술적으로도 가능하다는 입장을 취합니다. 텍스트 임베딩과 같은 현재의 방법들은 사실적 내용을 신뢰성 있게 보존하지 못하며, 단순한 패러프레이징은 법적으로 안전하지 않을 수 있습니다. 우리는 학술 문서를 LLM을 사용하여 지식 단위(Knowledge Units)로 변환하는 새로운 아이디어를 커뮤니티에 채택할 것을 촉구합니다. 이러한 단위는 스타일적 내용 없이 엔티티, 속성 및 관계를 포착하는 구조화된 데이터를 사용합니다. 우리는 지식 단위가 (1) 독일 저작권법과 미국 공정 사용 원칙에 대한 법적 분석을 바탕으로 저작권이 있는 연구 텍스트의 지식을 공유하기 위한 법적으로 방어 가능한 프레임워크를 형성하며, (2) 네 가지 연구 분야에서 원본 저작권 텍스트의 사실에 대한 MCQ 성능을 측정한 결과, 원본 텍스트의 대부분(~95%)의 사실적 지식을 보존한다는 증거를 제시합니다. 과학 지식을 저작권으로부터 해방시키는 것은 언어 모델이 저작권이 있는 텍스트의 중요한 사실을 재사용할 수 있게 함으로써 과학 연구와 교육에 혁신적인 이점을 약속합니다. 이를 지원하기 위해 연구 문서를 지식 단위로 변환하는 오픈소스 도구를 공유합니다. 전반적으로, 우리의 작업은 저작권을 존중하면서 과학 지식에 대한 접근을 민주화하는 것이 가능하다는 점을 제시합니다.

English

Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We urge the community to adopt a new idea: convert scholarly documents into Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units: (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.