生成画像を用いた思考

要旨

私たちは「生成画像を用いた思考」という新しいパラダイムを提案します。これは、大規模マルチモーダルモデル（LMM）が視覚的推論に取り組む方法を根本的に変革し、中間的な視覚的思考ステップを自発的に生成することで、テキストと視覚のモダリティを横断して自然に思考できるようにするものです。現在のLMMを用いた視覚的推論は、ユーザーが提供した固定画像を処理するか、テキストベースの連鎖思考（CoT）のみを通じて推論することに制限されています。「生成画像を用いた思考」は、モデルが中間的な視覚的思考を積極的に構築し、自身の視覚的仮説を批判し、それを推論プロセスの不可欠な要素として洗練するという、新たな認知能力の次元を開拓します。私たちは、以下の2つの補完的なメカニズムを通じてこのアプローチの有効性を実証します：（1）中間的な視覚的サブゴールを用いた視覚生成。ここでは、モデルが複雑な視覚タスクを管理可能なコンポーネントに分解し、それらを段階的に生成・統合します。（2）自己批判を用いた視覚生成。ここでは、モデルが最初の視覚的仮説を生成し、テキストベースの推論を通じてその欠点を分析し、自身の批判に基づいて洗練された出力を生成します。視覚生成ベンチマークでの実験では、ベースラインアプローチに対して大幅な改善が見られ、複雑なマルチオブジェクトシナリオの処理において最大50％（38％から57％）の相対的改善を達成しました。生化学者が新しいタンパク質構造を探求したり、建築家が空間デザインを繰り返し検討したり、法科学者が犯罪現場を再構築したり、バスケットボール選手が戦略的なプレイを構想したりする際に、私たちのアプローチはAIモデルが人間の創造的、分析的、戦略的思考を特徴づけるような視覚的想像力と反復的洗練に従事することを可能にします。私たちはオープンソーススイートをhttps://github.com/GAIR-NLP/thinking-with-generated-imagesで公開しています。

English

We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.