워터마킹이 언어 모델의 정렬을 저하시키는 현상: 분석 및 완화 방안

초록

대규모 언어 모델(LLM)에 대한 워터마킹 기술은 출력 품질에 상당한 영향을 미칠 수 있지만, 진실성, 안전성 및 유용성에 미치는 영향은 여전히 심각하게 연구되지 않고 있다. 본 논문은 두 가지 인기 있는 워터마킹 접근법인 Gumbel과 KGW가 네 가지 정렬된 LLM에서 이러한 핵심 정렬 특성에 어떻게 영향을 미치는지에 대한 체계적인 분석을 제시한다. 우리의 실험은 두 가지 뚜렷한 성능 저하 패턴을 밝혀냈다: 가드 약화(guard attenuation), 즉 유용성의 강화가 모델의 안전성을 훼손하는 경우와 가드 증폭(guard amplification), 즉 지나친 주의로 인해 모델의 유용성이 감소하는 경우이다. 이러한 패턴은 워터마킹으로 인한 토큰 분포의 변화에서 비롯되며, 정렬 목표 간의 근본적인 긴장을 드러낸다. 이러한 성능 저하를 완화하기 위해, 우리는 외부 보상 모델을 사용하여 정렬을 복원하는 추론 시점 샘플링 방법인 정렬 재샘플링(Alignment Resampling, AR)을 제안한다. 우리는 샘플 크기가 증가함에 따라 기대 보상 점수의 개선에 대한 이론적 하한을 설정하고, 단 2~4개의 워터마킹된 생성물을 샘플링하는 것만으로도 기준(워터마킹되지 않은) 정렬 점수를 효과적으로 회복하거나 초과할 수 있음을 실증적으로 입증한다. 표준 Gumbel 워터마킹의 제한된 응답 다양성을 극복하기 위해, 우리의 수정된 구현은 엄격한 왜곡 없음(distortion-freeness)을 희생하면서도 강력한 탐지 가능성을 유지하여 AR과의 호환성을 보장한다. 실험 결과는 AR이 두 워터마킹 접근법 모두에서 기준 정렬을 성공적으로 회복하면서도 강력한 워터마크 탐지 가능성을 유지함을 확인한다. 이 연구는 워터마크 강도와 모델 정렬 간의 중요한 균형을 밝히며, 실무에서 워터마킹된 LLM을 책임 있게 배포하기 위한 간단한 추론 시점 해결책을 제공한다.

English

Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.

워터마킹이 언어 모델의 정렬을 저하시키는 현상: 분석 및 완화 방안

Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

초록

Support