線性集成清除浮水印：論大型語言模型中分佈擾動的脆弱性

摘要

浮水印技術透過在AI生成文本中嵌入統計特徵以實現偵測與溯源。我們揭示一項根本性弱點：當使用者存取多個模型（即當前現實情況）時，浮水印將輕易失效。浮水印擾動了輸出機率分布使其偏離原始分布，而在競爭市場中，不同供應商之間的這些擾動通常相互獨立。我們從理論上證明，對輸出機率分布進行平均化處理可恢復無浮水印的分布，僅殘留二階誤差項。實證結果顯示，僅需平均3-5個模型的輸出即可抵消這些擾動。我們提出WASH（統計混合降浮水印技術），解決了異質模型集成生成時詞彙對齊與分詞差異等實務挑戰。在六種浮水印方案與三個大型語言模型的實驗中，平均三個模型的輸出能使偵測z值從5-300降至2以下（低於偵測閾值4），並在5%假陽性率條件下將真陽性率壓至50%以下；同時生成品質提升27.5%，長序列生成速度較最佳基線快6倍。研究結果顯示，若要透過浮水印實現穩健的AI文本偵測，若非接受此根本性弱點，便需模型供應商間達成前所未見的協調。

English

Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.