PASA：一種針對語意不變攻擊下LLM生成文本的有原則的嵌入空間浮水印方法

摘要

針對大型語言模型（LLM）的水印技術，是一項極具前景的方法，可用於偵測LLM生成的文本，並促進負責任的部署。然而，現有的水印方法經常容易受到語義不變攻擊（例如改寫）的影響。我們提出PASA，一種原理完善、穩健且無失真的水印演算法，能夠在語義層級嵌入和偵測水印。PASA在潛在嵌入空間中的語義聚類上進行操作，並透過共享隨機性（由秘密金鑰和語義歷史同步）來建立標記序列與輔助序列之間的分佈依賴關係。此設計基於我們所提出的理論框架，該框架刻畫了聯合最優的嵌入-偵測配對，實現了偵測準確度、穩健性和失真之間的基本權衡。在多個LLM和語義不變攻擊上的評估顯示，PASA即使在強改寫攻擊下仍能保持穩健，同時維持高文本品質，優於標準的詞彙空間基線。消融研究進一步驗證了我們超參數選擇的有效性。網頁：https://ai-kunkun.github.io/PASA_page/。

English

Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai-kunkun.github.io/PASA_page/.

PASA：一種針對語意不變攻擊下LLM生成文本的有原則的嵌入空間浮水印方法

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

摘要

Support