SILO 언어 모델: 비모수적 데이터 저장소에서 법적 리스크 격리하기

초록

저작권이 있거나 제한된 데이터를 사용하여 언어 모델(LM)을 훈련시키는 것의 합법성은 현재 치열한 논쟁 중에 있습니다. 그러나 우리가 보여주듯이, 저위험 텍스트(예: 저작권이 만료된 책이나 정부 문서)만으로 훈련된 모델은 그 크기와 도메인 범위가 제한적이기 때문에 성능이 크게 저하됩니다. 우리는 이러한 위험과 성능 간의 균형을 추론 과정에서 관리하는 새로운 언어 모델인 SILO를 제안합니다. SILO는 (1) 공개 라이선스 코퍼스(OLC)라는 새로운 코퍼스(228B 토큰의 퍼블릭 도메인 및 허가된 라이선스 텍스트로 구성)를 기반으로 파라메트릭 언어 모델을 훈련시키고, (2) 추론 중에만 쿼리되는 보다 일반적이고 쉽게 수정 가능한 비파라메트릭 데이터스토어(예: 저작권이 있는 책이나 뉴스)를 추가하여 구축됩니다. 이 데이터스토어는 고위험 데이터를 훈련 없이 사용할 수 있게 하며, 문장 수준의 데이터 출처 추적을 지원하고, 데이터 생산자가 콘텐츠를 데이터스토어에서 제거함으로써 모델에서 제외될 수 있도록 합니다. 이러한 기능은 미국의 공정 사용 원칙(Fair Use Doctrine) 및 유럽 연합의 GDPR과 같은 데이터 사용 규정 준수를 촉진할 수 있습니다. 우리의 실험 결과, 파라메트릭 언어 모델은 OLC에서 다루지 않는 도메인에서 어려움을 겪는 것으로 나타났습니다. 그러나 데이터스토어에 접근하면 도메인 외 성능이 크게 향상되어, 주로 고위험 텍스트로 구성된 더 다양한 코퍼스인 Pile로 훈련된 언어 모델과의 성능 격차를 90%까지 좁힐 수 있었습니다. 또한 우리는 어떤 비파라메트릭 접근 방식이 가장 효과적인지, 남아 있는 오류가 어디에 있는지, 그리고 데이터스토어 크기에 따라 성능이 어떻게 확장되는지 분석했습니다. 우리의 결과는 법적 위험을 완화하면서도 고품질의 언어 모델을 구축하는 것이 가능함을 시사합니다.

English

The legality of training language models (LMs) on copyrighted or otherwise restricted data is under intense debate. However, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. The datastore allows use of high-risk data without training on it, supports sentence-level data attribution, and enables data producers to opt out from the model by removing content from the store. These capabilities can foster compliance with data-use regulations such as the fair use doctrine in the United States and the GDPR in the European Union. Our experiments show that the parametric LM struggles on domains not covered by OLC. However, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile, a more diverse corpus with mostly high-risk text. We also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. Our results suggest that it is possible to build high quality language models while mitigating their legal risk.

SILO 언어 모델: 비모수적 데이터 저장소에서 법적 리스크 격리하기

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

초록

Support