通过偏差反转实现大语言模型水印规避
LLM Watermark Evasion via Bias Inversion
September 27, 2025
作者: Jeongyeon Hwang, Sangdon Park, Jungseul Ok
cs.AI
摘要
针对大型语言模型(LLMs)的水印技术,通过在文本生成过程中嵌入统计信号,以实现对模型输出文本的检测。尽管水印在良性环境下已被证实有效,但其在对抗性规避下的鲁棒性仍存争议。为了深入理解并系统评估此类脆弱性,我们提出了理论驱动且模型无关的“偏差反转重写攻击”(BIRA)。BIRA通过在基于LLM的重写过程中抑制可能带有水印标记的logits值,无需了解底层水印方案,即可削弱水印信号。在多种最新水印方法上,BIRA实现了超过99%的规避率,同时保持了原文的语义内容。除了展示攻击手段外,我们的研究揭示了一种系统性漏洞,强调了压力测试与构建更强健防御机制的必要性。
English
Watermarking for large language models (LLMs) embeds a statistical signal
during generation to enable detection of model-produced text. While
watermarking has proven effective in benign settings, its robustness under
adversarial evasion remains contested. To advance a rigorous understanding and
evaluation of such vulnerabilities, we propose the Bias-Inversion
Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic.
BIRA weakens the watermark signal by suppressing the logits of likely
watermarked tokens during LLM-based rewriting, without any knowledge of the
underlying watermarking scheme. Across recent watermarking methods, BIRA
achieves over 99\% evasion while preserving the semantic content of the
original text. Beyond demonstrating an attack, our results reveal a systematic
vulnerability, emphasizing the need for stress testing and robust defenses.