透かしが言語モデルを放射性にする

要旨

本論文は、LLM（大規模言語モデル）によって生成されたテキストの放射性、すなわちそのような入力が訓練データとして使用されたかどうかを検出可能かどうかを調査する。従来のメンバーシップ推論のような手法は、ある程度の精度でこの検出を実行できる。我々は、透かしを施した訓練データが、メンバーシップ推論よりも検出が容易で、はるかに信頼性の高い痕跡を残すことを示す。汚染レベルを、透かしの堅牢性、訓練セット内での割合、およびファインチューニングプロセスに関連付ける。特に、透かしを施した合成指示文で訓練を行った場合、訓練テキストのわずか5%が透かし入りであっても、高い信頼度（p値 < 1e-5）で検出可能であることを実証する。したがって、元々機械生成テキストの検出のために設計されたLLMの透かし技術は、透かし入りのLLMの出力が別のLLMのファインチューニングに使用されたかどうかを容易に識別する能力を提供する。

English

This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination level to the watermark robustness, its proportion in the training set, and the fine-tuning process. We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence (p-value < 1e-5) even when as little as 5% of training text is watermarked. Thus, LLM watermarking, originally designed for detecting machine-generated text, gives the ability to easily identify if the outputs of a watermarked LLM were used to fine-tune another LLM.

透かしが言語モデルを放射性にする

Watermarking Makes Language Models Radioactive

要旨

Support