PASA: セマンティック不変攻撃下におけるLLM生成テキストのための原理に基づく埋め込み空間透かし手法

要旨

大規模言語モデル（LLM）向けの電子透かし技術は、LLMが生成したテキストを検出し、責任あるデプロイメントを可能にする有望なアプローチである。しかし、既存の透かし手法は、パラフレーズなどの意味不変攻撃に対して脆弱であることが多い。本稿では、PASAという原理に基づいた、ロバストで歪みのない透かしアルゴリズムを提案する。PASAは意味レベルで透かしを埋め込み、検出する。PASAは潜在埋め込み空間における意味クラスタ上で動作し、秘密鍵と意味履歴によって同期された共有ランダム性を介して、トークン系列と補助系列の間に分布的依存関係を構築する。この設計は、本稿の理論的枠組みに基づいており、この枠組みは、検出精度、ロバスト性、歪みの間の基本的なトレードオフを達成する、同時最適な埋め込み-検出ペアを特徴づける。複数のLLMと意味不変攻撃に対する評価により、PASAは強いパラフレーズ攻撃下でもロバスト性を維持し、高いテキスト品質を保持し、標準的な語彙空間ベースラインを上回ることが示された。アブレーション研究により、ハイパーパラメータ選択の有効性がさらに検証された。Webページ: https://ai-kunkun.github.io/PASA_page/。

English

Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai-kunkun.github.io/PASA_page/.

PASA: セマンティック不変攻撃下におけるLLM生成テキストのための原理に基づく埋め込み空間透かし手法

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

要旨

Support