语言模型的鲁棒无失真水印
Robust Distortion-free Watermarks for Language Models
July 28, 2023
作者: Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang
cs.AI
摘要
我们提出了一种方法论,用于在自回归语言模型中植入水印,这些水印对扰动具有鲁棒性,而不改变文本的分布,直到达到一定的最大生成预算。我们通过将一系列随机数(使用随机化水印密钥计算)映射到语言模型的样本来生成带水印的文本。要检测带水印的文本,任何知道密钥的一方都可以将文本与随机数序列对齐。我们使用两种抽样方案实例化了我们的水印方法:逆变换抽样和指数最小抽样。我们将这些水印应用于三个语言模型 — OPT-1.3B、LLaMA-7B 和 Alpaca-7B — 以实验验证它们的统计能力和对各种释义攻击的鲁棒性。值得注意的是,对于 OPT-1.3B 和 LLaMA-7B 模型,我们发现即使在通过随机编辑(即替换、插入或删除)破坏了 40-50% 的标记后,我们仍然可以可靠地检测到带水印的文本(p ≤ 0.01)从 35 个标记。对于 Alpaca-7B 模型,我们对响应典型用户指令的水印化可行性进行了案例研究。由于响应的熵较低,检测更加困难:大约 25% 的响应(中位长度约为 100 个标记)可以以 p ≤ 0.01 检测到,而且水印对我们实施的某些自动释义攻击也不太鲁棒。
English
We propose a methodology for planting watermarks in text from an
autoregressive language model that are robust to perturbations without changing
the distribution over text up to a certain maximum generation budget. We
generate watermarked text by mapping a sequence of random numbers -- which we
compute using a randomized watermark key -- to a sample from the language
model. To detect watermarked text, any party who knows the key can align the
text to the random number sequence. We instantiate our watermark methodology
with two sampling schemes: inverse transform sampling and exponential minimum
sampling. We apply these watermarks to three language models -- OPT-1.3B,
LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power
and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B
and LLaMA-7B models, we find we can reliably detect watermarked text (p leq
0.01) from 35 tokens even after corrupting between 40-50\% of the tokens
via random edits (i.e., substitutions, insertions or deletions). For the
Alpaca-7B model, we conduct a case study on the feasibility of watermarking
responses to typical user instructions. Due to the lower entropy of the
responses, detection is more difficult: around 25% of the responses -- whose
median length is around 100 tokens -- are detectable with p leq 0.01, and
the watermark is also less robust to certain automated paraphrasing attacks we
implement.