SILO 語言模型：在非參數化數據存儲中隔離法律風險

摘要

在訓練語言模型（LMs）時使用受版權或其他受限制的數據的合法性正引起激烈辯論。然而，正如我們所展示的，如果僅在低風險文本（例如，過期版權的書籍或政府文件）上進行訓練，模型性能會顯著下降，這是由於其規模有限且領域覆蓋範圍有限所致。我們提出了SILO，這是一種新的語言模型，在推理過程中管理這種風險-性能的平衡。SILO是通過（1）在我們精心策劃的包含228B令牌的公共領域和許可授權文本的新語料庫Open License Corpus（OLC）上訓練參數化LM，以及（2）利用更一般且易於修改的非參數化數據存儲庫（例如，包含受版權保護的書籍或新聞）來擴充它而構建的。在推理過程中僅查詢此數據存儲庫。數據存儲庫允許在未對其進行訓練的情況下使用高風險數據，支持句級數據歸屬，並使數據生產者可以通過從存儲庫中刪除內容來選擇退出模型。這些功能有助於遵守數據使用法規，如美國的合理使用原則和歐盟的GDPR。我們的實驗表明，參數化LM在OLC未涵蓋的領域中遇到困難。然而，存取數據存儲庫極大地改善了跨領域性能，將性能差距與在Pile上訓練的LM（該語料庫主要包含高風險文本）縮小了90％。我們還分析了哪種非參數化方法效果最好，剩餘的錯誤在哪裡，以及性能如何隨著數據存儲庫的大小而提升。我們的結果表明，可以在減輕法律風險的同時構建高質量的語言模型。

English

The legality of training language models (LMs) on copyrighted or otherwise restricted data is under intense debate. However, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. The datastore allows use of high-risk data without training on it, supports sentence-level data attribution, and enables data producers to opt out from the model by removing content from the store. These capabilities can foster compliance with data-use regulations such as the fair use doctrine in the United States and the GDPR in the European Union. Our experiments show that the parametric LM struggles on domains not covered by OLC. However, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile, a more diverse corpus with mostly high-risk text. We also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. Our results suggest that it is possible to build high quality language models while mitigating their legal risk.

SILO 語言模型：在非參數化數據存儲中隔離法律風險

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

摘要

Support