언어 모델을 위한 강건한 왜곡 없는 워터마크

초록

우리는 특정 최대 생성 예산 범위 내에서 텍스트의 분포를 변경하지 않으면서도 섭동에 강건한 워터마크를 자동회귀 언어 모델의 텍스트에 삽입하는 방법론을 제안한다. 우리는 무작위 워터마크 키를 사용해 계산한 일련의 난수 시퀀스를 언어 모델의 샘플에 매핑하여 워터마크가 포함된 텍스트를 생성한다. 워터마크가 포함된 텍스트를 탐지하기 위해, 키를 알고 있는 어떤 당사자라도 텍스트를 난수 시퀀스에 정렬할 수 있다. 우리는 이 워터마크 방법론을 역변환 샘플링과 지수 최소 샘플링이라는 두 가지 샘플링 기법으로 구현한다. 이 워터마크를 OPT-1.3B, LLaMA-7B, Alpaca-7B 세 가지 언어 모델에 적용하여 통계적 검출력과 다양한 패러프레이징 공격에 대한 강건성을 실험적으로 검증한다. 특히, OPT-1.3B와 LLaMA-7B 모델의 경우, 토큰의 40-50%를 무작위 편집(즉, 치환, 삽입 또는 삭제)으로 손상시킨 후에도 35개의 토큰부터 워터마크가 포함된 텍스트를 신뢰적으로 탐지할 수 있음을 확인했다(p ≤ 0.01). Alpaca-7B 모델의 경우, 일반적인 사용자 지시에 대한 응답에 워터마크를 적용하는 가능성에 대한 사례 연구를 수행했다. 응답의 엔트로피가 낮기 때문에 탐지가 더 어려웠다: 중간 길이가 약 100 토큰인 응답 중 약 25%가 p ≤ 0.01로 탐지 가능했으며, 우리가 구현한 특정 자동화된 패러프레이징 공격에 대해서도 워터마크가 덜 강건했다.

English

We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text (p leq 0.01) from 35 tokens even after corrupting between 40-50\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around 25% of the responses -- whose median length is around 100 tokens -- are detectable with p leq 0.01, and the watermark is also less robust to certain automated paraphrasing attacks we implement.

언어 모델을 위한 강건한 왜곡 없는 워터마크

Robust Distortion-free Watermarks for Language Models

초록

Support