水印技术削弱语言模型的对齐能力:分析与缓解策略
Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
June 4, 2025
作者: Apurv Verma, NhatHai Phan, Shubhendu Trivedi
cs.AI
摘要
大型语言模型(LLMs)的水印技术虽能显著影响输出质量,但其对真实性、安全性和实用性的影响仍亟待深入探究。本文系统分析了两种主流水印方法——Gumbel与KGW——在四种对齐LLMs上如何影响这些核心对齐属性。实验揭示出两种不同的性能退化模式:防护衰减,即实用性增强削弱了模型安全性;防护放大,即过度谨慎降低了模型实用性。这些模式源于水印引发的令牌分布变化,凸显了对齐目标间存在的根本性张力。
为缓解此类退化,我们提出了对齐重采样(Alignment Resampling, AR),一种在推理时利用外部奖励模型恢复对齐的采样方法。我们建立了随着样本量增加,预期奖励分数改进的理论下限,并通过实验证明,仅对2-4个水印生成样本进行采样,即可有效恢复甚至超越基线(无水印)对齐分数。针对标准Gumbel水印响应多样性有限的问题,我们的改进实现牺牲了严格的失真无痕性,同时保持了强健的检测能力,确保了与AR的兼容性。实验结果证实,AR成功地在两种水印方法中恢复了基线对齐,同时保持了强大的水印检测能力。
本研究揭示了水印强度与模型对齐之间的关键平衡,为实践中负责任地部署带水印LLMs提供了一个简单的推理时解决方案。
English
Watermarking techniques for large language models (LLMs) can significantly
impact output quality, yet their effects on truthfulness, safety, and
helpfulness remain critically underexamined. This paper presents a systematic
analysis of how two popular watermarking approaches-Gumbel and KGW-affect these
core alignment properties across four aligned LLMs. Our experiments reveal two
distinct degradation patterns: guard attenuation, where enhanced helpfulness
undermines model safety, and guard amplification, where excessive caution
reduces model helpfulness. These patterns emerge from watermark-induced shifts
in token distribution, surfacing the fundamental tension that exists between
alignment objectives.
To mitigate these degradations, we propose Alignment Resampling (AR), an
inference-time sampling method that uses an external reward model to restore
alignment. We establish a theoretical lower bound on the improvement in
expected reward score as the sample size is increased and empirically
demonstrate that sampling just 2-4 watermarked generations effectively recovers
or surpasses baseline (unwatermarked) alignment scores. To overcome the limited
response diversity of standard Gumbel watermarking, our modified implementation
sacrifices strict distortion-freeness while maintaining robust detectability,
ensuring compatibility with AR. Experimental results confirm that AR
successfully recovers baseline alignment in both watermarking approaches, while
maintaining strong watermark detectability. This work reveals the critical
balance between watermark strength and model alignment, providing a simple
inference-time solution to responsibly deploy watermarked LLMs in practice.