線形アンサンブルがウォーターマークを洗い流す：LLMにおける分布摂動の脆弱性について

要旨

透かし技術は、AI生成テキストに統計的特徴を埋め込み、検出や帰属を可能にする。本研究では、ユーザーが複数のモデルにアクセスする現在の状況において、透かしが容易に無効化されるという根本的な脆弱性を明らかにする。透かしは出力分布を本来の分布から摂動させるが、競争市場においてこれらの摂動は通常、プロバイダ間で独立している。我々は、出力確率分布を平均化することで、2次誤差項までの精度で非透かし分布が復元されることを理論的に証明する。実験では、3～5個のモデルを単純に平均するだけで、これらの摂動が打ち消される。さらに、異種モデル間での語彙不一致やトークン化の違いといったアンサンブル生成における実用的課題を解決する手法、WASH（Watermark Attenuation via Statistical Hybridisation）を導入する。6種類の透かし方式と3つのLLMを用いた実験により、3モデルの平均化によって検出zスコアが5～300から2未満（検出閾値4以下）に抑制され、偽陽性率5%における真陽性率が50%未満に低下する一方、品質は27.5%向上し、長文生成において最良のベースラインよりも6倍高速に動作することを示す。これらの結果は、透かしによる頑健なAIテキスト検出には、この根本的な脆弱性を受け入れるか、あるいはモデル提供者間での前例のない協調が必要であることを示唆する。

English

Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.