想象助力视觉推理，但潜空间尚未实现

摘要

潜在视觉推理旨在通过多模态大语言模型的隐状态进行冥想，从而模拟人类的想象过程。尽管该范式被公认为视觉推理的前沿方向，但其有效性背后的运作机制仍不明确。为揭示其效能根源，我们采用因果中介分析对潜在推理的有效性进行验证。我们将该过程建模为因果链：输入作为处理变量，潜在标记作为中介变量，最终答案作为结果变量。研究发现两个关键脱节现象：(a) 输入-潜在脱节：对输入施加剧烈扰动时，潜在标记仅产生可忽略的变化，表明潜在标记未能有效关注输入序列；(b) 潜在-答案脱节：扰动潜在标记对最终答案影响微弱，揭示潜在标记对结果变量的因果效应有限。进一步的大规模探针分析表明，潜在标记仅编码有限视觉信息且呈现高度相似性。基于此，我们对潜在推理的必要性提出质疑，并提出名为CapImagine的简洁替代方案——通过文本教导模型进行显式想象。在视觉中心基准测试上的实验表明，CapImagine显著优于复杂的隐空间基线，彰显了通过显式想象实现视觉推理的卓越潜力。

English

Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.