想像力有助於視覺推理,但尚未在潛在空間中實現
Imagination Helps Visual Reasoning, But Not Yet in Latent Space
February 26, 2026
作者: You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun
cs.AI
摘要
潛在視覺推理旨在透過多模態大型語言模型的隱藏狀態進行冥想,以模擬人類的想像過程。儘管該方法被視為極具前景的視覺推理範式,但其有效性背後的運作機制仍不明朗。為揭示其真正效能來源,我們採用因果中介分析法檢驗潛在推理的有效性。我們將該過程建模為因果鏈:輸入作為處理項,潛在標記作為中介變量,最終答案作為結果變量。研究發現兩個關鍵斷層:(a) 輸入-潛在斷層:對輸入進行劇烈擾動時,潛在標記僅產生可忽略的變化,表明潛在標記未能有效關注輸入序列;(b) 潛在-答案斷層:對潛在標記施加擾動對最終答案影響微弱,顯示潛在標記對結果的因果效應有限。進一步的探針分析表明,潛在標記編碼的視覺信息有限且呈現高度相似性。據此,我們對潛在推理的必要性提出質疑,並提出名為CapImagine的簡潔替代方案,該方法教導模型使用文本進行顯式想像。在視覺中心基準測試上的實驗表明,CapImagine顯著優於複雜的潛在空間基線模型,彰顯了透過顯式想像實現視覺推理的卓越潛力。
English
Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.