SketchVLM:视觉语言模型能够通过图像标注阐释思维并引导用户
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
April 23, 2026
作者: Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen
cs.AI
摘要
在回答图像相关问题时,人类会自然地通过指点、标注和绘图来解释推理过程。相比之下,现代视觉语言模型(如Gemini-3-Pro和GPT-5)仅能生成文本回复,这使得用户难以验证其推理。我们提出SketchVLM——一种免训练、模型无关的框架,可使视觉语言模型在输入图像上生成非破坏性、可编辑的SVG叠加层,从而对其答案进行可视化解释。在涵盖视觉推理(迷宫导航、落球轨迹预测和物体计数)与绘图(部件标注、点连线、物体轮廓描绘)的七项基准测试中,SketchVLM将视觉推理任务准确率最高提升28.5个百分点,注释质量相较图像编辑和微调绘图基线最高提升1.48倍,同时生成的注释与模型所述答案的契合度更高。研究发现,单轮生成已能实现较强的准确性和注释质量,而多轮生成为人机协作开辟了更多可能性。交互演示和代码详见https://sketchvlm.github.io/。
English
When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.