コサインは誤解を招く：補助損失は視覚言語モデルを再形成するのであって、その潜在表現を再形成するのではない

要旨

潜在視覚推論（LVR）は、視覚言語モデル（VLM）において、知覚と回答生成の間に教師あり潜在トークンを挿入する手法である。この分野では、これらの潜在表現とその視覚ターゲットとの間のアライメント（コサイン類似度や平均二乗誤差（MSE））を、訓練損失および品質指標の両方として用いており、より良いアライメントがより良い回答をもたらすと仮定している。我々は設計した5種類のLVRバリアントのマトリックスを用いてこれを検証し、仮定が逆転していることを発見した。すなわち、コサインアライメントは5つすべてにおいて精度と負の相関を示した（r = -0.94）。この結果を説明するために、我々はPRISMを導入する。これは推論時に使用する一対の診断手法であり、回答がどこでデコード可能かを問う線形プローブと、潜在表現が荷重を支えているかどうかを問う破壊テストから成る。教師あり潜在表現はほとんど迂回されている。これらを破壊しても、精度の変化は最大で4ポイントである。回答は潜在表現の下流ではデコード可能であるが、その位置ではデコード不可能であり、このデコード可能性のギャップの大きさは、摂動下における各バリアントが自身の潜在表現にどの程度依存しているかを予測する。損失に対する情報ボトルネック的解釈と一致して、補助目的関数は、名目上最適化される潜在変数ではなく、共有パラメータを介して言語モデルを再形成する。

English

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.