潜在草图板:通过绘制视觉草图激发多模态大语言模型的多模态推理能力
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
October 28, 2025
作者: Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei
cs.AI
摘要
尽管多模态大语言模型(MLLMs)在视觉理解方面表现出色,但在需要视觉规划与想象力的复杂场景中往往表现不佳。受人类通过草图作为视觉思维工具来构思和传达想法的启发,我们提出了潜在画板(Latent Sketchpad)框架,为MLLMs配备了内部视觉草稿本。传统上MLLMs的内部视觉表征仅局限于感知理解,我们将其重新定位为支持生成式视觉思维,同时不损害推理能力。基于前沿MLLMs,我们的方法将视觉生成直接整合到其原生自回归推理流程中,使模型能够将文本推理与视觉潜变量的生成交错进行。这些潜变量既能引导内部思维过程,也可通过译码生成可解释的草图图像。为实现这一目标,我们引入两个核心组件:情境感知视觉头(Context-Aware Vision Head)负责自回归生成视觉表征,预训练草图解码器(Sketch Decoder)则将其渲染为人类可理解的图像。我们在新构建的迷宫规划数据集(MazePlanning)上评估该框架,实验表明潜在画板在各类MLLMs中均能取得与骨干网络相当甚至更优的推理性能,并成功泛化至Gemma3、Qwen2.5-VL等不同前沿模型。通过将模型的文本推理能力延伸至视觉思维领域,我们的框架为人机交互的丰富化和应用场景的拓展开辟了新路径。更多细节与资源请访问项目主页:https://latent-sketchpad.github.io/。
English
While Multimodal Large Language Models (MLLMs) excel at visual understanding,
they often struggle in complex scenarios that require visual planning and
imagination. Inspired by how humans use sketching as a form of visual thinking
to develop and communicate ideas, we introduce Latent Sketchpad, a framework
that equips MLLMs with an internal visual scratchpad. The internal visual
representations of MLLMs have traditionally been confined to perceptual
understanding. We repurpose them to support generative visual thought without
compromising reasoning ability. Building on frontier MLLMs, our approach
integrates visual generation directly into their native autoregressive
reasoning process. It allows the model to interleave textual reasoning with the
generation of visual latents. These latents guide the internal thought process
and can be translated into sketch images for interpretability. To realize this,
we introduce two components: a Context-Aware Vision Head autoregressively
produces visual representations, and a pretrained Sketch Decoder renders these
into human-interpretable images. We evaluate the framework on our new dataset
MazePlanning. Experiments across various MLLMs show that Latent Sketchpad
delivers comparable or even superior reasoning performance to their backbone.
It further generalizes across distinct frontier MLLMs, including Gemma3 and
Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our
framework opens new opportunities for richer human-computer interaction and
broader applications. More details and resources are available on our project
page: https://latent-sketchpad.github.io/.