机器心智意象：利用潜在视觉标记赋能多模态推理

摘要

视觉语言模型（VLMs）在多模态理解方面表现出色，然而其仅依赖文本的解码方式迫使它们将视觉推理过程语言化，这在需要视觉想象力的任务上限制了性能。近期研究尝试训练VLMs生成显式图像，但繁重的图像生成预训练往往削弱了其推理能力。受人类通过心理意象——即内部构建与操控视觉线索——进行推理的启发，我们探索了VLMs是否能在不生成显式图像的情况下，通过交错的多模态轨迹进行推理。为此，我们提出了名为“幻象”（Mirage）的机器心理意象框架，该框架通过在普通文本之外增加潜在视觉标记来增强VLM的解码能力。具体而言，当模型选择“视觉思考”时，它会将其隐藏状态重构为下一标记，从而在不生成像素级图像的情况下延续多模态轨迹。我们首先通过从真实图像嵌入中蒸馏来监督潜在标记，随后转向仅文本监督，使潜在轨迹紧密贴合任务目标。后续的强化学习阶段进一步提升了多模态推理能力。多样化的基准测试表明，Mirage在不生成显式图像的情况下，解锁了更强大的多模态推理能力。

English

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

机器心智意象：利用潜在视觉标记赋能多模态推理

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

摘要

Support