通過偏差反轉實現LLM水印規避

摘要

大型語言模型（LLMs）的水印技術在生成過程中嵌入統計信號，以便檢測模型產生的文本。雖然水印在良性環境中已被證明有效，但其在對抗性規避下的穩健性仍存在爭議。為了深入理解和評估此類漏洞，我們提出了偏見反轉重寫攻擊（BIRA），該攻擊具有理論基礎且與模型無關。BIRA通過在基於LLM的重寫過程中抑制可能帶有水印的標記的logits來削弱水印信號，而無需了解底層的水印方案。在最新的水印方法中，BIRA實現了超過99%的規避率，同時保留了原始文本的語義內容。除了展示攻擊之外，我們的結果揭示了一種系統性漏洞，強調了壓力測試和穩健防禦的必要性。

English

Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the Bias-Inversion Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

通過偏差反轉實現LLM水印規避

LLM Watermark Evasion via Bias Inversion

摘要

Support