ウォーターマーキングは言語モデルのアライメントを劣化させる：分析と緩和策

要旨

大規模言語モデル（LLM）のための透かし技術は、出力品質に大きな影響を与える可能性があるが、その真実性、安全性、および有用性への影響は依然として十分に検証されていない。本論文では、2つの主要な透かし手法（GumbelとKGW）が、4つの整備されたLLMにおいてこれらのコアアライメント特性にどのように影響を与えるかを体系的に分析する。実験結果から、2つの異なる劣化パターンが明らかになった：ガード減衰（有用性の向上がモデルの安全性を損なう）とガード増幅（過度の慎重さがモデルの有用性を低下させる）である。これらのパターンは、透かしによって引き起こされるトークン分布の変化から生じ、アライメント目標間の根本的な緊張関係を浮き彫りにする。これらの劣化を緩和するために、我々はアライメント再サンプリング（AR）を提案する。これは、外部の報酬モデルを使用してアライメントを回復する推論時のサンプリング手法である。サンプルサイズが増加するにつれて期待報酬スコアが改善される理論的下限を確立し、2～4回の透かし生成をサンプリングするだけで、ベースライン（透かしなし）のアライメントスコアを効果的に回復または超えることを実証する。標準的なGumbel透かしの応答多様性の限界を克服するため、修正された実装では厳密な歪みのない性質を犠牲にしつつ、堅牢な検出可能性を維持し、ARとの互換性を確保する。実験結果は、ARが両方の透かし手法においてベースラインアライメントを成功裏に回復し、強力な透かし検出可能性を維持することを確認する。本研究は、透かしの強度とモデルアライメントの間の重要なバランスを明らかにし、透かし付きLLMを実践的に責任を持って展開するためのシンプルな推論時ソリューションを提供する。

English

Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.

ウォーターマーキングは言語モデルのアライメントを劣化させる：分析と緩和策

Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

要旨

Support