水印使语言模型具有放射性

摘要

本文研究了LLM生成文本的放射性，即是否可能检测到这样的输入被用作训练数据。传统方法如成员推断可以在一定程度上准确地进行这种检测。我们展示了带有水印的训练数据留下的痕迹更容易检测，比成员推断更可靠。我们将污染水平与水印的鲁棒性、在训练集中的比例以及微调过程联系起来。我们特别展示了，即使只有5%的训练文本带有水印，基于带水印的合成指令进行训练也可以被高置信度地检测到（p值<1e-5）。因此，LLM水印技术，最初设计用于检测机器生成文本，使得能够轻松识别是否将带水印的LLM的输出用于微调另一个LLM。

English

This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination level to the watermark robustness, its proportion in the training set, and the fine-tuning process. We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence (p-value < 1e-5) even when as little as 5% of training text is watermarked. Thus, LLM watermarking, originally designed for detecting machine-generated text, gives the ability to easily identify if the outputs of a watermarked LLM were used to fine-tune another LLM.

水印使语言模型具有放射性

Watermarking Makes Language Models Radioactive

摘要

Support