PASA: 의미 불변 공격 하에서 LLM 생성 텍스트를 위한 원칙적인 임베딩 공간 워터마킹 접근법

초록

대규모 언어 모델(LLM)을 위한 워터마킹은 LLM 생성 텍스트를 탐지하고 책임 있는 배포를 가능하게 하는 유망한 접근 방식이다. 그러나 기존 워터마킹 방법은 의역(paraphrasing)과 같은 의미 보존 공격(semantic-invariant attacks)에 취약한 경우가 많다. 본 논문에서는 원칙적이고 강건하며 왜곡 없는 워터마킹 알고리즘인 PASA를 제안한다. PASA는 의미 수준에서 워터마크를 삽입하고 탐지하며, 잠재 임베딩 공간(latent embedding space) 내 의미 클러스터(semantic clusters)에 대해 작동하고, 비밀 키(secret key)와 의미 이력(semantic history)에 의해 동기화된 공유 무작위성(shared randomness)을 통해 토큰과 보조 시퀀스 간의 분포적 의존성(distributional dependency)을 구축한다. 이러한 설계는 탐지 정확도, 강건성, 왜곡 간의 근본적 상충 관계(fundamental trade-offs)를 달성하는 공동 최적 임베딩-탐지 쌍(jointly optimal embedding-detection pair)을 특성화하는 우리의 이론적 프레임워크에 기반한다. 여러 LLM과 의미 보존 공격에 걸친 평가는 PASA가 강력한 의역 공격 하에서도 높은 텍스트 품질을 유지하며 강건함을 보여주며, 표준 어휘 공간 기준선(vocabulary-space baselines)을 능가한다. 또한 절제 연구(ablation studies)는 하이퍼파라미터 선택의 효과성을 추가로 검증한다. 웹페이지: https://ai-kunkun.github.io/PASA_page/.

English

Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai-kunkun.github.io/PASA_page/.

PASA: 의미 불변 공격 하에서 LLM 생성 텍스트를 위한 원칙적인 임베딩 공간 워터마킹 접근법

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

초록

Support