以生成圖像進行思考

摘要

我們提出「生成圖像思維」這一新穎範式，它從根本上改變了大型多模態模型（LMMs）在視覺推理中的參與方式，使這些模型能夠通過自發生成中間視覺思維步驟，在文本與視覺模態之間進行原生思考。目前，LMMs的視覺推理僅限於處理用戶提供的固定圖像或僅通過基於文本的思維鏈（CoT）進行推理。「生成圖像思維」開啟了認知能力的新維度，使模型能夠主動構建中間視覺思維，批判自身的視覺假設，並將其作為推理過程的組成部分進行精煉。我們通過兩種互補機制展示了該方法的有效性：（1）帶有中間視覺子目標的視覺生成，模型將複雜的視覺任務分解為可管理的組件，並逐步生成與整合；（2）帶有自我批判的視覺生成，模型生成初始視覺假設，通過文本推理分析其不足，並基於自身批判生成精煉的輸出。我們在視覺生成基準測試中的實驗顯示，相較於基線方法，該方法取得了顯著改進，模型在處理複雜多物體場景時的相對改進高達50%（從38%提升至57%）。從探索新型蛋白質結構的生物化學家、迭代空間設計的建築師，到重建犯罪現場的法醫分析師，以及構想戰略戰術的籃球運動員，我們的方法使AI模型能夠參與到那種體現人類創造性、分析性和戰略性思維的視覺想像與迭代精煉中。我們已在https://github.com/GAIR-NLP/thinking-with-generated-images發布了開源套件。

English

We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.