SILO 语言模型:在非参数数据存储中隔离法律风险
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
August 8, 2023
作者: Sewon Min, Suchin Gururangan, Eric Wallace, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer
cs.AI
摘要
在对使用受版权或其他限制的数据训练语言模型(LMs)的合法性问题上存在激烈辩论。然而,正如我们所展示的,如果仅在低风险文本(例如过期版权书籍或政府文件)上进行训练,模型性能会显著下降,这是由于其规模有限且领域覆盖范围有限。我们提出了SILO,这是一个新的语言模型,在推理过程中管理这种风险-性能权衡。SILO的构建方式是:(1)在我们精心策划的包含228B个公共领域和许可文本的Open License Corpus(OLC)上训练参数化LM,(2)通过增加一个更通用且易于修改的非参数化数据存储库(例如包含受版权书籍或新闻的数据存储库),仅在推理过程中进行查询。数据存储库允许使用高风险数据而无需对其进行训练,支持句子级数据归因,并使数据生产者可以通过从存储库中删除内容来选择退出模型。这些功能可以促进符合数据使用法规,如美国的公平使用原则和欧盟的GDPR。我们的实验表明,参数化LM在OLC未涵盖的领域表现不佳。然而,访问数据存储库极大地提高了跨领域性能,将性能差距缩小了90%,与在Pile上训练的LM相比,Pile是一个包含大多数高风险文本的更多样化语料库。我们还分析了哪种非参数化方法效果最佳,剩余错误在哪里,以及性能如何随数据存储库规模扩展。我们的结果表明,可以在减轻法律风险的同时构建高质量的语言模型。
English
The legality of training language models (LMs) on copyrighted or otherwise
restricted data is under intense debate. However, as we show, model performance
significantly degrades if trained only on low-risk text (e.g., out-of-copyright
books or government documents), due to its limited size and domain coverage. We
present SILO, a new language model that manages this risk-performance tradeoff
during inference. SILO is built by (1) training a parametric LM on Open License
Corpus (OLC), a new corpus we curate with 228B tokens of public domain and
permissively licensed text and (2) augmenting it with a more general and easily
modifiable nonparametric datastore (e.g., containing copyrighted books or news)
that is only queried during inference. The datastore allows use of high-risk
data without training on it, supports sentence-level data attribution, and
enables data producers to opt out from the model by removing content from the
store. These capabilities can foster compliance with data-use regulations such
as the fair use doctrine in the United States and the GDPR in the European
Union. Our experiments show that the parametric LM struggles on domains not
covered by OLC. However, access to the datastore greatly improves out of domain
performance, closing 90% of the performance gap with an LM trained on the Pile,
a more diverse corpus with mostly high-risk text. We also analyze which
nonparametric approach works best, where the remaining errors lie, and how
performance scales with datastore size. Our results suggest that it is possible
to build high quality language models while mitigating their legal risk.