비주얼 스케치패드: 다중모달 언어 모델을 위한 시각적 사고 체인으로서의 스케치

초록

인간은 추론을 돕기 위해 그림을 그립니다: 기하학 문제를 풀 때 보조선을 그리고, 지도에서 추론할 때 표시하고 동그라미를 치며, 아이디어를 확장하고 제한된 작업 기억을 완화하기 위해 스케치를 사용합니다. 그러나 이러한 행동은 현재의 다중모달 언어 모델(LMs)에서는 누락되어 있습니다. 현재의 사고 사슬(chain-of-thought)과 도구 사용 패러다임은 중간 추론 단계로 텍스트만을 사용합니다. 본 연구에서는 다중모달 LMs에 시각적 스케치패드와 그 위에 그림을 그릴 수 있는 도구를 제공하는 Sketchpad 프레임워크를 소개합니다. LM은 자신이 그린 시각적 아티팩트에 따라 계획과 추론을 수행합니다. 텍스트-이미지 모델을 사용해 LMs가 그림을 그리도록 한 기존 연구와 달리, Sketchpad는 LMs가 선, 상자, 표시 등을 사용해 그림을 그릴 수 있게 하여 인간의 스케치에 더 가깝고 추론을 더 잘 돕습니다. Sketchpad는 스케치 과정에서 전문적인 비전 모델(예: 객체 탐지 모델로 경계 상자를 그리거나, 분할 모델로 마스크를 그리는 등)을 사용하여 시각적 인식과 추론을 더욱 강화할 수도 있습니다. 우리는 다양한 수학 과제(기하학, 함수, 그래프, 체스 포함)와 복잡한 시각적 추론 과제를 실험했습니다. Sketchpad는 스케치 없이 강력한 기본 모델보다 모든 과제에서 성능을 크게 향상시켰으며, 수학 과제에서는 평균 12.7%, 비전 과제에서는 8.6%의 성능 향상을 보였습니다. Sketchpad를 사용한 GPT-4o는 V*Bench(80.3%), BLINK 공간 추론(83.9%), 시각적 일치(80.8%)를 포함한 모든 과제에서 새로운 최첨단 기술을 달성했습니다. 모든 코드와 데이터는 https://visualsketchpad.github.io/에서 확인할 수 있습니다.

English

Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/.

비주얼 스케치패드: 다중모달 언어 모델을 위한 시각적 사고 체인으로서의 스케치

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

초록

Support