ChatPaper.aiChatPaper

視覺速寫板:對於多模式語言模型而言,速寫是一種視覺思維鏈。

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

June 13, 2024
作者: Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna
cs.AI

摘要

人類繪製圖像以促進推理:在解決幾何問題時,我們會繪製輔助線;在地圖推理時,我們會標記和圈出特定區域;我們使用草圖來擴展想法並減輕有限容量的工作記憶壓力。然而,目前的多模態語言模型(LMs)缺乏這樣的行為。目前的思維連貫和工具使用範式僅使用文本作為中間推理步驟。在這項工作中,我們介紹了Sketchpad,這是一個框架,為多模態LMs提供了一個視覺草圖板和繪製工具。LM根據其所繪製的視覺藝術品進行規劃和推理。與以往使用文本轉圖像模型使LMs能夠繪製不同,Sketchpad使LMs能夠使用線條、方框、標記等進行繪製,這更接近人類的草圖繪製方式並更有助於推理。Sketchpad還可以在繪製過程中使用專業視覺模型(例如,使用物體檢測模型繪製邊界框,使用分割模型繪製遮罩),進一步增強視覺感知和推理能力。我們對各種數學任務(包括幾何、函數、圖形和國際象棋)以及複雜的視覺推理任務進行了實驗。Sketchpad在所有任務上均顯著提高了性能,比沒有繪圖的強基礎模型平均提高了12.7%的數學任務和8.6%的視覺任務。具有Sketchpad的GPT-4o在所有任務上均創下了新的最佳表現,包括V*Bench(80.3%)、BLINK空間推理(83.9%)和視覺對應(80.8%)。所有代碼和數據都在https://visualsketchpad.github.io/。
English
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/.

Summary

AI-Generated Summary

PDF221December 6, 2024