ビジュアルスケッチパッド：マルチモーダル言語モデルのための視覚的思考連鎖としてのスケッチ

要旨

人間は推論を助けるために描画を行います：幾何学の問題を解く際に補助線を引いたり、地図上で推論する際に印をつけたり丸で囲んだり、スケッチを使ってアイデアを広げたり、限られたワーキングメモリの負担を軽減したりします。しかし、現在のマルチモーダル言語モデル（LM）には、そのような行動が欠けています。現在の連鎖思考（chain-of-thought）やツール使用のパラダイムでは、中間的な推論ステップとしてテキストのみを使用しています。本研究では、マルチモーダルLMに視覚的なスケッチパッドとその上に描画するためのツールを提供するフレームワーク「Sketchpad」を紹介します。LMは、自身が描いた視覚的なアーティファクトに基づいて計画と推論を行います。従来の研究とは異なり、テキストから画像を生成するモデルを使ってLMに描画させるのではなく、SketchpadはLMに線やボックス、マークなどを描かせることで、人間のスケッチに近づけ、推論をより効果的に支援します。Sketchpadはまた、スケッチングプロセス中に専門的な視覚モデルを使用することもできます（例えば、物体検出モデルでバウンディングボックスを描いたり、セグメンテーションモデルでマスクを描いたり）。これにより、視覚的知覚と推論がさらに強化されます。私たちは、数学タスク（幾何学、関数、グラフ、チェスを含む）や複雑な視覚推論タスクを幅広く実験しました。Sketchpadは、スケッチングを行わない強力なベースモデルと比較して、すべてのタスクでパフォーマンスを大幅に向上させ、数学タスクでは平均12.7%、視覚タスクでは8.6%の向上をもたらしました。Sketchpadを搭載したGPT-4oは、V*Bench（80.3%）、BLINK空間推論（83.9%）、視覚的対応（80.8%）を含むすべてのタスクで新たな最先端を達成しました。すべてのコードとデータはhttps://visualsketchpad.github.io/にあります。

English

Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/.

ビジュアルスケッチパッド：マルチモーダル言語モデルのための視覚的思考連鎖としてのスケッチ

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

要旨

Support