バイアス反転によるLLM透かし回避

要旨

大規模言語モデル（LLM）のための透かし技術は、生成過程において統計的な信号を埋め込み、モデルが生成したテキストの検出を可能にする。透かしは良性の環境下では有効であることが証明されているものの、敵対的な回避下での頑健性については依然として議論の余地がある。こうした脆弱性に対する厳密な理解と評価を進めるため、我々は理論的に動機付けられ、モデルに依存しないBias-Inversion Rewriting Attack（BIRA）を提案する。BIRAは、基盤となる透かしスキームに関する知識を一切必要とせず、LLMベースの書き換え中に透かしが埋め込まれた可能性の高いトークンのロジットを抑制することで、透かし信号を弱める。最新の透かし手法において、BIRAは元のテキストの意味内容を保ちつつ、99％以上の回避率を達成する。攻撃を実証するだけでなく、我々の結果は体系的な脆弱性を明らかにし、ストレステストと頑健な防御の必要性を強調するものである。

English

Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the Bias-Inversion Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

バイアス反転によるLLM透かし回避

LLM Watermark Evasion via Bias Inversion

要旨

Support