ChatPaper.aiChatPaper

机器心智意象:利用潜在视觉标记赋能多模态推理

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

June 20, 2025
作者: Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan
cs.AI

摘要

视觉语言模型(VLMs)在多模态理解方面表现出色,然而其仅依赖文本的解码方式迫使它们将视觉推理过程语言化,这在需要视觉想象力的任务上限制了性能。近期研究尝试训练VLMs生成显式图像,但繁重的图像生成预训练往往削弱了其推理能力。受人类通过心理意象——即内部构建与操控视觉线索——进行推理的启发,我们探索了VLMs是否能在不生成显式图像的情况下,通过交错的多模态轨迹进行推理。为此,我们提出了名为“幻象”(Mirage)的机器心理意象框架,该框架通过在普通文本之外增加潜在视觉标记来增强VLM的解码能力。具体而言,当模型选择“视觉思考”时,它会将其隐藏状态重构为下一标记,从而在不生成像素级图像的情况下延续多模态轨迹。我们首先通过从真实图像嵌入中蒸馏来监督潜在标记,随后转向仅文本监督,使潜在轨迹紧密贴合任务目标。后续的强化学习阶段进一步提升了多模态推理能力。多样化的基准测试表明,Mirage在不生成显式图像的情况下,解锁了更强大的多模态推理能力。
English
Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.
PDF142June 23, 2025