생성된 이미지로 사고하기

초록

우리는 생성된 이미지를 통한 사고(Thinking with Generated Images)라는 새로운 패러다임을 제시하며, 이는 대규모 다중모달 모델(LMM)이 중간 시각적 사고 단계를 자발적으로 생성함으로써 텍스트와 시각 모달리티를 넘나들며 사고할 수 있게 함으로써 시각적 추론과의 상호작용 방식을 근본적으로 변화시킵니다. 현재 LMM을 활용한 시각적 추론은 사용자가 제공한 고정된 이미지를 처리하거나 텍스트 기반의 사고의 연쇄(CoT)를 통해서만 추론하는 데 제한되어 있습니다. 생성된 이미지를 통한 사고는 모델이 능동적으로 중간 시각적 사고를 구성하고, 자신의 시각적 가설을 비판하며, 이를 추론 과정의 필수적인 구성 요소로 개선할 수 있는 새로운 인지 능력의 차원을 열어줍니다. 우리는 두 가지 상호보완적인 메커니즘을 통해 이 접근법의 효과를 입증합니다: (1) 중간 시각적 하위 목표를 통한 시각 생성, 여기서 모델은 복잡한 시각적 작업을 관리 가능한 구성 요소로 분해하고 이를 점진적으로 생성 및 통합하며, (2) 자기 비판을 통한 시각 생성, 여기서 모델은 초기 시각적 가설을 생성하고 텍스트 기반 추론을 통해 그 단점을 분석한 후 자신의 비판을 바탕으로 개선된 출력을 생성합니다. 시각 생성 벤치마크에서의 실험 결과, 우리의 모델은 복잡한 다중 객체 시나리오 처리에서 기준 접근법 대비 최대 50%(38%에서 57%로)의 상대적 개선을 달성하며, 이는 상당한 향상을 보여줍니다. 새로운 단백질 구조를 탐구하는 생화학자, 공간 설계를 반복하는 건축가, 범죄 현장을 재구성하는 법의학 분석가, 전략적 플레이를 구상하는 농구 선수에 이르기까지, 우리의 접근법은 AI 모델이 인간의 창의적, 분석적, 전략적 사고를 특징짓는 시각적 상상력과 반복적 개선에 참여할 수 있게 합니다. 우리는 이 오픈소스 도구를 https://github.com/GAIR-NLP/thinking-with-generated-images에서 공개합니다.

English

We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.