通过偏差反转实现大语言模型水印规避

摘要

针对大型语言模型（LLMs）的水印技术，通过在文本生成过程中嵌入统计信号，以实现对模型输出文本的检测。尽管水印在良性环境下已被证实有效，但其在对抗性规避下的鲁棒性仍存争议。为了深入理解并系统评估此类脆弱性，我们提出了理论驱动且模型无关的“偏差反转重写攻击”（BIRA）。BIRA通过在基于LLM的重写过程中抑制可能带有水印标记的logits值，无需了解底层水印方案，即可削弱水印信号。在多种最新水印方法上，BIRA实现了超过99%的规避率，同时保持了原文的语义内容。除了展示攻击手段外，我们的研究揭示了一种系统性漏洞，强调了压力测试与构建更强健防御机制的必要性。

English

Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the Bias-Inversion Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

通过偏差反转实现大语言模型水印规避

LLM Watermark Evasion via Bias Inversion

摘要

Support