关于大型语言模型水印可靠性的研究

摘要

大型语言模型（LLMs）现在已经部署到日常使用中，并定位为在未来十年内产生大量文本的工具。机器生成的文本可能会取代互联网上人工撰写的文本，并有潜力被用于恶意目的，如钓鱼攻击和社交媒体机器人。水印技术是一种简单而有效的策略，可以减轻这些危害，通过使LLM生成的文本能够被检测和记录。然而，一个关键问题仍然存在：在野外的现实环境中，水印技术有多可靠？在那里，带有水印的文本可能会与其他文本来源混合，被人类作者或其他语言模型改写，并被用于广泛领域的各种应用，无论是社会还是技术领域。在本文中，我们探讨了不同的检测方案，量化它们在检测水印方面的能力，并确定在每种情景下需要观察多少机器生成的文本才能可靠地检测到水印。特别是我们强调了我们的人类研究，我们在面对人类改写时调查了水印技术的可靠性。我们将基于水印的检测与其他检测策略进行了比较，总体发现水印技术是一种可靠的解决方案，尤其是由于其样本复杂性 - 对于我们考虑的所有攻击，水印证据随着给出的示例越多而增加，并最终检测到水印。

English

Large language models (LLMs) are now deployed to everyday use and positioned to produce large quantities of text in the coming decade. Machine-generated text may displace human-written text on the internet and has the potential to be used for malicious purposes, such as spearphishing attacks and social media bots. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet, a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text might be mixed with other text sources, paraphrased by human writers or other language models, and used for applications in a broad number of domains, both social and technical. In this paper, we explore different detection schemes, quantify their power at detecting watermarks, and determine how much machine-generated text needs to be observed in each scenario to reliably detect the watermark. We especially highlight our human study, where we investigate the reliability of watermarking when faced with human paraphrasing. We compare watermark-based detection to other detection strategies, finding overall that watermarking is a reliable solution, especially because of its sample complexity - for all attacks we consider, the watermark evidence compounds the more examples are given, and the watermark is eventually detected.

关于大型语言模型水印可靠性的研究

On the Reliability of Watermarks for Large Language Models

摘要

Support