대규모 언어 모델을 위한 워터마킹의 신뢰성에 관하여

초록

대형 언어 모델(LLMs)은 이제 일상적으로 사용되며 앞으로 10년 동안 대량의 텍스트를 생성할 것으로 예상된다. 기계 생성 텍스트는 인터넷상에서 인간이 작성한 텍스트를 대체할 가능성이 있으며, 스피어피싱 공격이나 소셜 미디어 봇과 같은 악의적인 목적으로 사용될 수도 있다. 워터마킹은 LLM 생성 텍스트의 탐지와 문서화를 가능하게 함으로써 이러한 피해를 완화하기 위한 간단하면서도 효과적인 전략이다. 그러나 중요한 질문이 남아 있다: 실제 환경에서 워터마킹은 얼마나 신뢰할 수 있는가? 실제 환경에서는 워터마킹된 텍스트가 다른 텍스트 소스와 혼합되거나, 인간 작가나 다른 언어 모델에 의해 재구성될 수 있으며, 사회적 및 기술적 다양한 분야에서 활용될 수 있다. 본 논문에서는 다양한 탐지 기법을 탐구하고, 워터마크를 탐지하는 데 있어 그들의 능력을 정량화하며, 각 시나리오에서 워터마크를 신뢰할 수 있게 탐지하기 위해 얼마나 많은 기계 생성 텍스트가 관찰되어야 하는지를 결정한다. 특히, 인간의 재구성에 직면했을 때 워터마킹의 신뢰성을 조사한 인간 연구를 강조한다. 워터마크 기반 탐지를 다른 탐지 전략과 비교한 결과, 워터마킹은 특히 샘플 복잡성 측면에서 신뢰할 수 있는 해결책임을 확인하였다. 우리가 고려한 모든 공격에 대해, 더 많은 예제가 제공될수록 워터마크 증거가 누적되어 결국 워터마크가 탐지된다.

English

Large language models (LLMs) are now deployed to everyday use and positioned to produce large quantities of text in the coming decade. Machine-generated text may displace human-written text on the internet and has the potential to be used for malicious purposes, such as spearphishing attacks and social media bots. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet, a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text might be mixed with other text sources, paraphrased by human writers or other language models, and used for applications in a broad number of domains, both social and technical. In this paper, we explore different detection schemes, quantify their power at detecting watermarks, and determine how much machine-generated text needs to be observed in each scenario to reliably detect the watermark. We especially highlight our human study, where we investigate the reliability of watermarking when faced with human paraphrasing. We compare watermark-based detection to other detection strategies, finding overall that watermarking is a reliable solution, especially because of its sample complexity - for all attacks we consider, the watermark evidence compounds the more examples are given, and the watermark is eventually detected.

대규모 언어 모델을 위한 워터마킹의 신뢰성에 관하여

On the Reliability of Watermarks for Large Language Models

초록

Support