通過偏差反轉實現LLM水印規避
LLM Watermark Evasion via Bias Inversion
September 27, 2025
作者: Jeongyeon Hwang, Sangdon Park, Jungseul Ok
cs.AI
摘要
大型語言模型(LLMs)的水印技術在生成過程中嵌入統計信號,以便檢測模型產生的文本。雖然水印在良性環境中已被證明有效,但其在對抗性規避下的穩健性仍存在爭議。為了深入理解和評估此類漏洞,我們提出了偏見反轉重寫攻擊(BIRA),該攻擊具有理論基礎且與模型無關。BIRA通過在基於LLM的重寫過程中抑制可能帶有水印的標記的logits來削弱水印信號,而無需了解底層的水印方案。在最新的水印方法中,BIRA實現了超過99%的規避率,同時保留了原始文本的語義內容。除了展示攻擊之外,我們的結果揭示了一種系統性漏洞,強調了壓力測試和穩健防禦的必要性。
English
Watermarking for large language models (LLMs) embeds a statistical signal
during generation to enable detection of model-produced text. While
watermarking has proven effective in benign settings, its robustness under
adversarial evasion remains contested. To advance a rigorous understanding and
evaluation of such vulnerabilities, we propose the Bias-Inversion
Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic.
BIRA weakens the watermark signal by suppressing the logits of likely
watermarked tokens during LLM-based rewriting, without any knowledge of the
underlying watermarking scheme. Across recent watermarking methods, BIRA
achieves over 99\% evasion while preserving the semantic content of the
original text. Beyond demonstrating an attack, our results reveal a systematic
vulnerability, emphasizing the need for stress testing and robust defenses.