機器心智意象：利用潛在視覺標記賦能多模態推理

摘要

視覺語言模型（VLMs）在多模態理解方面表現卓越，但其僅限於文本的解碼方式迫使它們將視覺推理轉化為語言表達，這限制了在需要視覺想像力的任務上的表現。近期研究嘗試訓練VLMs生成顯式圖像，但繁重的圖像生成預訓練往往削弱了其推理能力。受人類利用心理意象——即內部構建和操作視覺線索——進行推理的方式啟發，我們探討了VLMs是否能夠通過交織的多模態軌跡進行推理，而無需生成顯式圖像。為此，我們提出了一種名為“幻象”（Mirage）的機器心理意象框架，該框架在VLM解碼過程中加入了潛在視覺標記，與普通文本並行。具體而言，當模型選擇“視覺思考”時，它會將其隱藏狀態重構為下一個標記，從而繼續多模態軌跡，而無需生成像素級圖像。我們首先通過從真實圖像嵌入中蒸餾來監督潛在標記，隨後轉向僅文本監督，使潛在軌跡與任務目標緊密對齊。後續的強化學習階段進一步增強了多模態推理能力。在多樣化基準測試上的實驗表明，Mirage在不生成顯式圖像的情況下，釋放了更強大的多模態推理能力。

English

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

機器心智意象：利用潛在視覺標記賦能多模態推理

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

摘要

Support