關於大型語言模型水印的可靠性

摘要

大型語言模型（LLMs）現已部署至日常使用，並定位為未來十年將產生大量文本的工具。機器生成的文本可能取代互聯網上的人工撰寫文本，並有潛力被用於惡意目的，如魚叉式釣魚攻擊和社交媒體機器人。數字水印是一種簡單而有效的策略，可減輕此類損害，因為它能夠檢測和記錄由LLM生成的文本。然而，一個關鍵問題仍然存在：在野外的現實環境中，數字水印的可靠性如何？在那裡，帶有水印的文本可能與其他文本來源混合，被人類作者或其他語言模型改寫，並被應用於眾多社會和技術領域。在本文中，我們探討不同的檢測方案，量化它們在檢測水印方面的能力，並確定在每種情況下需要觀察多少機器生成的文本才能可靠地檢測水印。我們特別強調我們的人類研究，我們在面對人類改寫時調查水印的可靠性。我們將基於水印的檢測與其他檢測策略進行比較，總的來說發現，水印技術是一種可靠的解決方案，尤其是由於其樣本複雜性 - 對於我們考慮的所有攻擊，水印證據隨著提供的範例數量增加而累積，最終會被檢測到。

English

Large language models (LLMs) are now deployed to everyday use and positioned to produce large quantities of text in the coming decade. Machine-generated text may displace human-written text on the internet and has the potential to be used for malicious purposes, such as spearphishing attacks and social media bots. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet, a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text might be mixed with other text sources, paraphrased by human writers or other language models, and used for applications in a broad number of domains, both social and technical. In this paper, we explore different detection schemes, quantify their power at detecting watermarks, and determine how much machine-generated text needs to be observed in each scenario to reliably detect the watermark. We especially highlight our human study, where we investigate the reliability of watermarking when faced with human paraphrasing. We compare watermark-based detection to other detection strategies, finding overall that watermarking is a reliable solution, especially because of its sample complexity - for all attacks we consider, the watermark evidence compounds the more examples are given, and the watermark is eventually detected.

關於大型語言模型水印的可靠性

On the Reliability of Watermarks for Large Language Models

摘要

Support