PASA:一种在语义不变攻击下针对大语言模型生成文本的基于原则的嵌入空间水印方法
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
May 9, 2026
作者: Zhenxin Ai, Haiyun He
cs.AI
摘要
面向大型语言模型(LLMs)的水印技术是一种检测LLM生成文本并实现负责任部署的有前景方法。然而,现有水印方法通常容易受到语义不变攻击(如改写)的影响。我们提出PASA——一种原则性、鲁棒且无失真的水印算法,能够在语义层面嵌入和检测水印。PASA在潜在嵌入空间中的语义聚类上运行,并通过共享随机性(由密钥和语义历史同步)构建令牌序列与辅助序列之间的分布依赖关系。该设计基于我们的理论框架,该框架刻画了联合最优的嵌入-检测对,实现了检测精度、鲁棒性和失真之间的基本权衡。在多个LLM和语义不变攻击上的评估表明,即使在强改写攻击下,PASA仍能保持鲁棒性,同时保留高文本质量,性能优于标准词汇空间基线方法。消融研究进一步验证了我们的超参数选择的有效性。网页:https://ai-kunkun.github.io/PASA_page/。
English
Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai-kunkun.github.io/PASA_page/.