워터마킹이 언어 모델을 방사성 물질로 만든다

초록

본 논문은 LLM(대형 언어 모델)이 생성한 텍스트의 방사능(radioactivity), 즉 해당 입력이 학습 데이터로 사용되었는지를 탐지할 수 있는 가능성을 조사합니다. 멤버십 추론(membership inference)과 같은 기존 방법들은 어느 정도의 정확도로 이러한 탐지를 수행할 수 있습니다. 우리는 워터마크가 적용된 학습 데이터가 멤버십 추론보다 탐지가 훨씬 용이하고 신뢰할 수 있는 흔적을 남긴다는 것을 보여줍니다. 우리는 오염 수준을 워터마크의 견고성, 학습 데이터셋 내 비율, 그리고 미세 조정(fine-tuning) 과정과 연결지어 설명합니다. 특히, 학습 텍스트의 단 5%만 워터마크가 적용된 경우에도 워터마크가 적용된 합성 명령어(synthetic instructions)를 학습한 것을 높은 신뢰도(p-value < 1e-5)로 탐지할 수 있음을 입증합니다. 따라서, 원래 기계 생성 텍스트 탐지를 위해 설계된 LLM 워터마킹은 워터마크가 적용된 LLM의 출력이 다른 LLM을 미세 조정하는 데 사용되었는지를 쉽게 식별할 수 있는 능력을 제공합니다.

English

This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination level to the watermark robustness, its proportion in the training set, and the fine-tuning process. We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence (p-value < 1e-5) even when as little as 5% of training text is watermarked. Thus, LLM watermarking, originally designed for detecting machine-generated text, gives the ability to easily identify if the outputs of a watermarked LLM were used to fine-tune another LLM.

워터마킹이 언어 모델을 방사성 물질로 만든다

Watermarking Makes Language Models Radioactive

초록

Support