机器心智意象:利用潜在视觉标记赋能多模态推理
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
June 20, 2025
作者: Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan
cs.AI
摘要
视觉语言模型(VLMs)在多模态理解方面表现出色,然而其仅依赖文本的解码方式迫使它们将视觉推理过程语言化,这在需要视觉想象力的任务上限制了性能。近期研究尝试训练VLMs生成显式图像,但繁重的图像生成预训练往往削弱了其推理能力。受人类通过心理意象——即内部构建与操控视觉线索——进行推理的启发,我们探索了VLMs是否能在不生成显式图像的情况下,通过交错的多模态轨迹进行推理。为此,我们提出了名为“幻象”(Mirage)的机器心理意象框架,该框架通过在普通文本之外增加潜在视觉标记来增强VLM的解码能力。具体而言,当模型选择“视觉思考”时,它会将其隐藏状态重构为下一标记,从而在不生成像素级图像的情况下延续多模态轨迹。我们首先通过从真实图像嵌入中蒸馏来监督潜在标记,随后转向仅文本监督,使潜在轨迹紧密贴合任务目标。后续的强化学习阶段进一步提升了多模态推理能力。多样化的基准测试表明,Mirage在不生成显式图像的情况下,解锁了更强大的多模态推理能力。
English
Vision-language models (VLMs) excel at multimodal understanding, yet their
text-only decoding forces them to verbalize visual reasoning, limiting
performance on tasks that demand visual imagination. Recent attempts train VLMs
to render explicit images, but the heavy image-generation pre-training often
hinders the reasoning ability. Inspired by the way humans reason with mental
imagery-the internal construction and manipulation of visual cues-we
investigate whether VLMs can reason through interleaved multimodal trajectories
without producing explicit images. To this end, we present a Machine Mental
Imagery framework, dubbed as Mirage, which augments VLM decoding with latent
visual tokens alongside ordinary text. Concretely, whenever the model chooses
to ``think visually'', it recasts its hidden states as next tokens, thereby
continuing a multimodal trajectory without generating pixel-level images. Begin
by supervising the latent tokens through distillation from ground-truth image
embeddings, we then switch to text-only supervision to make the latent
trajectory align tightly with the task objective. A subsequent reinforcement
learning stage further enhances the multimodal reasoning capability.
Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger
multimodal reasoning without explicit image generation.