機器心智意象:利用潛在視覺標記賦能多模態推理
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
June 20, 2025
作者: Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan
cs.AI
摘要
視覺語言模型(VLMs)在多模態理解方面表現卓越,但其僅限於文本的解碼方式迫使它們將視覺推理轉化為語言表達,這限制了在需要視覺想像力的任務上的表現。近期研究嘗試訓練VLMs生成顯式圖像,但繁重的圖像生成預訓練往往削弱了其推理能力。受人類利用心理意象——即內部構建和操作視覺線索——進行推理的方式啟發,我們探討了VLMs是否能夠通過交織的多模態軌跡進行推理,而無需生成顯式圖像。為此,我們提出了一種名為“幻象”(Mirage)的機器心理意象框架,該框架在VLM解碼過程中加入了潛在視覺標記,與普通文本並行。具體而言,當模型選擇“視覺思考”時,它會將其隱藏狀態重構為下一個標記,從而繼續多模態軌跡,而無需生成像素級圖像。我們首先通過從真實圖像嵌入中蒸餾來監督潛在標記,隨後轉向僅文本監督,使潛在軌跡與任務目標緊密對齊。後續的強化學習階段進一步增強了多模態推理能力。在多樣化基準測試上的實驗表明,Mirage在不生成顯式圖像的情況下,釋放了更強大的多模態推理能力。
English
Vision-language models (VLMs) excel at multimodal understanding, yet their
text-only decoding forces them to verbalize visual reasoning, limiting
performance on tasks that demand visual imagination. Recent attempts train VLMs
to render explicit images, but the heavy image-generation pre-training often
hinders the reasoning ability. Inspired by the way humans reason with mental
imagery-the internal construction and manipulation of visual cues-we
investigate whether VLMs can reason through interleaved multimodal trajectories
without producing explicit images. To this end, we present a Machine Mental
Imagery framework, dubbed as Mirage, which augments VLM decoding with latent
visual tokens alongside ordinary text. Concretely, whenever the model chooses
to ``think visually'', it recasts its hidden states as next tokens, thereby
continuing a multimodal trajectory without generating pixel-level images. Begin
by supervising the latent tokens through distillation from ground-truth image
embeddings, we then switch to text-only supervision to make the latent
trajectory align tightly with the task objective. A subsequent reinforcement
learning stage further enhances the multimodal reasoning capability.
Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger
multimodal reasoning without explicit image generation.