浮水印使語言模型具有放射性

摘要

本文探討了由LLM生成的文本的放射性，即是否可能檢測到此類輸入被用作訓練數據。像成員推斷這樣的傳統方法可以以某種準確度進行此檢測。我們表明，帶有水印的訓練數據留下的痕跡更容易檢測，比成員推斷更可靠。我們將污染水平與水印的穩健性、其在訓練集中的比例和微調過程相關聯。值得注意的是，我們顯示，即使僅有5%的訓練文本帶有水印，也可以高度自信地檢測到在帶有水印的合成指令上進行的訓練（p值<1e-5）。因此，LLM水印技術，最初設計用於檢測機器生成文本，使我們能夠輕鬆識別出是否將帶有水印的LLM的輸出用於微調另一個LLM。

English

This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination level to the watermark robustness, its proportion in the training set, and the fine-tuning process. We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence (p-value < 1e-5) even when as little as 5% of training text is watermarked. Thus, LLM watermarking, originally designed for detecting machine-generated text, gives the ability to easily identify if the outputs of a watermarked LLM were used to fine-tune another LLM.

浮水印使語言模型具有放射性

Watermarking Makes Language Models Radioactive

摘要

Support