通过生成图像进行思考

摘要

我们提出“生成式图像思维”这一创新范式，从根本上改变了大型多模态模型（LMMs）进行视觉推理的方式，使其能够通过自发生成中间视觉思维步骤，在文本与视觉模态间实现原生跨模态思考。当前LMMs的视觉推理局限于处理用户提供的固定图像或仅通过基于文本的链式思维（CoT）进行推理。生成式图像思维解锁了认知能力的新维度，模型能够主动构建中间视觉思维，批判自身的视觉假设，并将其作为推理过程的有机组成部分进行优化。我们通过两种互补机制展示了该方法的有效性：（1）带有中间视觉子目标的视觉生成，模型将复杂视觉任务分解为可管理的组件，逐步生成并整合；（2）带有自我批判的视觉生成，模型首先生成初步视觉假设，通过文本推理分析其不足，并基于自我批判生成优化后的输出。在视觉生成基准测试中，我们的方法相较于基线模型取得了显著提升，在处理复杂多对象场景时实现了高达50%（从38%提升至57%）的相对改进。从探索新型蛋白质结构的生物化学家、迭代空间设计的建筑师，到重建犯罪现场的刑侦分析师，以及构思战术配合的篮球运动员，我们的方法使AI模型能够参与那种体现人类创造性、分析性和战略性思维的视觉想象与迭代优化过程。我们在https://github.com/GAIR-NLP/thinking-with-generated-images发布了开源工具包。

English

We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.