餘弦誤導：輔助損失重塑視覺語言模型，而非其潛在表徵

摘要

潛在視覺推理（LVR）在視覺語言模型（VLM）中於感知與答案生成之間插入監督式潛在標記。該領域利用這些潛在表徵與其視覺目標之間的對齊程度（即餘弦相似度或均方誤差）作為訓練損失與品質指標，假設更好的對齊能帶來更佳的答案。我們透過一個包含五種LVR變體的設計矩陣測試此假設，發現該假設恰好相反：餘弦對齊與所有五種變體的準確率呈負相關（r=-0.94）。為解釋此現象，我們提出PRISM，一組推理階段的診斷工具：一個線性探針用於探測答案在何處可被解碼，以及一個破壞性測試用於檢驗潛在表徵是否承載關鍵資訊。結果顯示監督式潛在表徵在很大程度上被繞過：破壞它們僅使準確率最多偏移四個百分點。答案可在潛在表徵的下游而非其本身被解碼，且此解碼能力差距的大小可預測每個變體在擾動下對其潛在表徵的依賴程度。與資訊瓶頸對損失函數的解讀一致，輔助目標實際上是透過共享參數重塑語言模型，而非透過其所名義上最佳化的潛在變數。

English

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.