余弦误导：辅助损失重塑视觉语言模型，而非其潜变量

摘要

潜在视觉推理（LVR）在视觉语言模型（VLM）的感知与答案生成之间插入有监督的潜在标记。该领域通过计算这些潜在标记与其视觉目标之间的对齐程度（即余弦相似度或均方误差）作为训练损失和质量指标，假设更好的对齐能带来更优的答案。我们针对五种LVR变体构建了系统化的测试矩阵，发现这一假设与实际结果相反：在所有五种变体中，余弦对齐度与准确率呈负相关（r=-0.94）。为解释这一现象，我们提出了PRISM——一组推理时诊断工具：线性探测（用于判断答案在何处可被解码）和破坏性测试（用于判断潜在标记是否承担关键负载）。结果表明，这些有监督的潜在标记在很大程度上被绕过了，破坏它们对准确率的影响最多仅为四个百分点。答案可在潜在标记的下游而非其本身解码，且这一可解码性差距的大小能够预测每种变体在扰动条件下对潜在标记的依赖程度。与损失函数的信息瓶颈解读一致，这一辅助目标通过共享参数重塑了语言模型，而非通过其名义上优化的潜在变量来实现。

English

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.