Marcatori d'acqua robusti e privi di distorsione per modelli linguistici

Abstract

Proponiamo una metodologia per inserire watermark nei testi generati da un modello linguistico autoregressivo che siano robusti alle perturbazioni senza alterare la distribuzione del testo fino a un determinato budget massimo di generazione. Generiamo testo con watermark mappando una sequenza di numeri casuali -- che calcoliamo utilizzando una chiave di watermark randomizzata -- a un campione del modello linguistico. Per rilevare il testo con watermark, qualsiasi parte che conosca la chiave può allineare il testo alla sequenza di numeri casuali. Istanziamo la nostra metodologia di watermark con due schemi di campionamento: campionamento per trasformazione inversa e campionamento esponenziale minimo. Applichiamo questi watermark a tre modelli linguistici -- OPT-1.3B, LLaMA-7B e Alpaca-7B -- per validare sperimentalmente la loro potenza statistica e la robustezza a vari attacchi di parafrasi. In particolare, per entrambi i modelli OPT-1.3B e LLaMA-7B, troviamo che possiamo rilevare in modo affidabile il testo con watermark (p ≤ 0.01) a partire da 35 token anche dopo aver corrotto tra il 40-50% dei token tramite modifiche casuali (ad esempio, sostituzioni, inserimenti o eliminazioni). Per il modello Alpaca-7B, conduciamo uno studio di caso sulla fattibilità di applicare watermark alle risposte a istruzioni tipiche dell'utente. A causa della minore entropia delle risposte, il rilevamento è più difficile: circa il 25% delle risposte -- la cui lunghezza mediana è di circa 100 token -- è rilevabile con p ≤ 0.01, e il watermark è anche meno robusto a certi attacchi di parafrasi automatizzati che abbiamo implementato.

English

We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text (p leq 0.01) from 35 tokens even after corrupting between 40-50\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around 25% of the responses -- whose median length is around 100 tokens -- are detectable with p leq 0.01, and the watermark is also less robust to certain automated paraphrasing attacks we implement.

Marcatori d'acqua robusti e privi di distorsione per modelli linguistici

Robust Distortion-free Watermarks for Language Models

Abstract

Support