线性集成消除水印：论大型语言模型中分布扰动的脆弱性

摘要

水印技术通过在AI生成文本中嵌入统计特征，实现检测与溯源。我们揭示了一个根本性漏洞：当用户同时访问多个模型（这是当下的常态），水印便轻易失效。水印使输出分布偏离原始分布，而在竞争性市场中，这些偏离通常在不同提供商之间相互独立。我们从理论上证明，对输出概率分布取均值可以恢复无水印分布，误差仅为二阶项。实验表明，仅需对3-5个模型取均值即可消除这些扰动。我们提出WASH（水印衰减统计混合法），解决了异构模型集成生成中的实际挑战：词汇对齐差异和分词差异。实验覆盖六种水印方案和三种大语言模型，结果显示，对3个模型取均值可将检测z值从5-300降至2以下（低于检测阈值4），在5%假正率下将真正率降至50%以下，同时质量提升27.5%，在长序列生成任务中运行速度比最优基线快6倍。我们的结果表明，通过水印实现可靠的AI文本检测，要么接受这一根本性漏洞，要么需要模型提供商之间进行前所未有的协同合作。

English

Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.