ChatPaper.aiChatPaper

潜在画板:通过草图视觉思维激发多模态大语言模型的多维推理能力

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

October 28, 2025
作者: Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei
cs.AI

摘要

尽管多模态大语言模型(MLLMs)在视觉理解方面表现出色,但在需要视觉规划与想象的复杂场景中往往表现不佳。受人类通过草图进行视觉思考以发展并传达想法的启发,我们提出了潜在画板(Latent Sketchpad)框架,为MLLMs配备内部视觉草稿本。传统上MLLMs的内部视觉表征仅局限于感知理解,我们将其重新定位以支持生成式视觉思维,同时不损害推理能力。基于前沿MLLMs,我们的方法将视觉生成直接整合到其原生自回归推理过程中,使模型能够交错进行文本推理与视觉潜变量的生成。这些潜变量既能引导内部思维过程,也可通过草图解码器转化为可解释的图像。为实现这一目标,我们引入两个核心组件:上下文感知视觉头自回归生成视觉表征,预训练草图解码器将其渲染为人类可理解的图像。我们在新数据集MazePlanning上评估该框架,实验表明潜在画板在各类MLLMs中均能实现与骨干网络相当甚至更优的推理性能,并成功泛化至Gemma3、Qwen2.5-VL等不同前沿模型。通过将模型的文本推理扩展至视觉思维,本框架为人机交互的丰富化和应用场景的拓宽开辟了新途径。更多细节与资源请访问项目页面:https://latent-sketchpad.github.io/。
English
While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.
PDF201December 1, 2025