코사인은 오도한다: 보조 손실이 비전-언어 모델을 재구성할 뿐, 그 잠재 표현은 재구성하지 않는다

초록

잠재 시각 추론(LVR)은 시각-언어 모델(VLM)에서 인식과 답변 생성 사이에 지도 학습된 잠재 토큰을 삽입한다. 해당 분야는 이 잠재 변수와 시각적 목표 간의 정렬(즉, 코사인 유사도 또는 평균 제곱 오차(MSE))을 훈련 손실이자 품질 지표로 사용하며, 더 나은 정렬이 더 나은 답변을 가져온다고 가정한다. 우리는 다섯 가지 LVR 변형으로 구성된 설계 행렬을 통해 이를 테스트했으며, 가정이 반전됨을 발견했다. 즉, 코사인 정렬은 다섯 변형 모두에서 정확도와 음의 상관관계를 보였다(r=-0.94). 이를 설명하기 위해 우리는 PRISM(추론 시 진단법 쌍)을 도입한다. 이는 정답이 디코딩 가능한 위치를 묻는 선형 프로브와 잠재 변수가 하중을 지탱하는지(즉, 실제로 중요한 역할을 하는지) 묻는 변형 테스트로 구성된다. 지도 학습된 잠재 변수는 대부분 우회된다. 이들을 변형해도 정확도는 최대 4포인트만 변한다. 정답은 잠재 변수 이후 하류에서는 디코딩 가능하지만 잠재 변수 자체에서는 디코딩 가능하지 않으며, 이 디코딩 가능성의 격차 크기는 각 변형이 섭동 하에서 잠재 변수에 의존하는 정도를 예측한다. 손실에 대한 정보 병목(Information Bottleneck) 해석과 일관되게, 보조 목적 함수는 명목상 최적화하는 잠재 변수를 통하기보다는 공유 파라미터를 통해 언어 모델을 재형성한다.

English

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.