プロジェクト・アレクサンドリア：LLMを活用した科学知識の著作権制約からの解放に向けて

要旨

ペイウォール、ライセンス、著作権規則は、科学知識の広範な普及と再利用をしばしば制限しています。私たちは、学術テキストから科学的知識を抽出することが法的にも技術的にも可能であるという立場を取ります。現在の手法、例えばテキスト埋め込みは、事実内容を確実に保存することができず、単純な言い換えは法的に適切でない場合があります。私たちは、学術文書をLLMを使用して知識ユニットに変換するという新しいアイデアを採用するようコミュニティに呼びかけます。これらのユニットは、スタイル的な内容を含まず、エンティティ、属性、関係を捉えた構造化データを使用します。私たちは、知識ユニットが以下の点を提供する証拠を示します：(1) ドイツの著作権法と米国のフェアユース原則に基づく法的分析により、著作権で保護された研究テキストから知識を共有するための法的に防御可能なフレームワークを形成し、(2) 元の著作権で保護されたテキストからの事実知識の大部分（約95％）を保存し、これは4つの研究分野にわたる元のテキストからの事実に関するMCQパフォーマンスで測定されます。科学知識を著作権から解放することは、言語モデルが著作権で保護されたテキストから重要な事実を再利用することを可能にすることで、科学研究と教育に変革的な利益をもたらすことを約束します。これを支援するために、研究文書を知識ユニットに変換するためのオープンソースツールを共有します。全体として、私たちの研究は、著作権を尊重しながら科学的知識へのアクセスを民主化することの実現可能性を示しています。

English

Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We urge the community to adopt a new idea: convert scholarly documents into Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units: (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.