O Cosseno Engana: Perdas Auxiliares Remodelam Modelos de Linguagem Visual, Não Seus Latentes

Resumo

Raciocínio visual latente (LVR) insere tokens latentes supervisionados entre a percepção e a geração de respostas em modelos de visão-linguagem (VLMs). A área utiliza o alinhamento entre esses latentes e seus alvos visuais — ou seja, similaridade de cosseno ou erro quadrático médio (EQM) — tanto como função de perda quanto como métrica de qualidade, assumindo que um melhor alinhamento resulta em uma melhor resposta. Testamos essa hipótese com uma matriz projetada de cinco variantes de LVR e descobrimos que a suposição se inverte: o alinhamento de cosseno apresenta correlação negativa com a acurácia em todas as cinco variantes (r = -0,94). Para explicar esse resultado, introduzimos o PRISM, um par de diagnósticos em tempo de inferência: uma sonda linear que investiga onde a resposta é decodificável e um teste de corrupção que verifica se o latente é estruturalmente relevante. Os latentes supervisionados são amplamente contornados. Corrompê-los altera a acurácia em, no máximo, quatro pontos percentuais. A resposta é decodificável a jusante do latente, mas não no próprio latente, e a magnitude dessa lacuna de decodificabilidade prediz o quanto cada variante depende de seu latente sob perturbação. Em consonância com uma leitura de Gargalo de Informação da perda, o objetivo auxiliar remodela o modelo de linguagem por meio de parâmetros compartilhados, em vez de fazê-lo pela variável latente que nominalmente otimiza.

English

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.