語言模型的強健無失真水印

摘要

我們提出了一種方法，可以在自回歸語言模型中植入水印，使其對干擾具有強韌性，同時在特定最大生成預算範圍內不改變文本分佈。我們通過將一系列隨機數映射到語言模型的樣本來生成帶水印的文本，這些隨機數是使用隨機水印密鑰計算的。要檢測帶水印的文本，任何知道該密鑰的一方都可以將文本與隨機數序列對齊。我們使用兩種抽樣方案實例化了我們的水印方法：逆變換抽樣和指數最小抽樣。我們將這些水印應用於三個語言模型 — OPT-1.3B、LLaMA-7B 和 Alpaca-7B — 以實驗驗證它們的統計功效和對各種改寫攻擊的強韌性。值得注意的是，對於 OPT-1.3B 和 LLaMA-7B 模型，我們發現即使在通過隨機編輯（即替換、插入或刪除）損壞了40-50％的標記後，我們仍然可以可靠地檢測到帶水印的文本（p ≤ 0.01），標記長度為35個標記。對於 Alpaca-7B 模型，我們對對典型用戶指令的回應進行了一個案例研究，由於回應的熵較低，檢測更加困難：約25％的回應（其中位數長度約為100個標記）可以在 p ≤ 0.01 的情況下檢測到，並且水印對我們實施的某些自動改寫攻擊也不夠強韌。

English

We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text (p leq 0.01) from 35 tokens even after corrupting between 40-50\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around 25% of the responses -- whose median length is around 100 tokens -- are detectable with p leq 0.01, and the watermark is also less robust to certain automated paraphrasing attacks we implement.

語言模型的強健無失真水印

Robust Distortion-free Watermarks for Language Models

摘要

Support