ChatPaper.aiChatPaper

通过偏差反转实现大语言模型水印规避

LLM Watermark Evasion via Bias Inversion

September 27, 2025
作者: Jeongyeon Hwang, Sangdon Park, Jungseul Ok
cs.AI

摘要

针对大型语言模型(LLMs)的水印技术,通过在文本生成过程中嵌入统计信号,以实现对模型输出文本的检测。尽管水印在良性环境下已被证实有效,但其在对抗性规避下的鲁棒性仍存争议。为了深入理解并系统评估此类脆弱性,我们提出了理论驱动且模型无关的“偏差反转重写攻击”(BIRA)。BIRA通过在基于LLM的重写过程中抑制可能带有水印标记的logits值,无需了解底层水印方案,即可削弱水印信号。在多种最新水印方法上,BIRA实现了超过99%的规避率,同时保持了原文的语义内容。除了展示攻击手段外,我们的研究揭示了一种系统性漏洞,强调了压力测试与构建更强健防御机制的必要性。
English
Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the Bias-Inversion Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.
PDF10October 1, 2025