探究图像编辑模型中的视觉规划能力

摘要

视觉规划作为人类智能的关键维度，在需要复杂空间推理与导航的任务中尤为重要。然而在机器学习领域，这一本质上的视觉问题却常被置于以语言为核心的框架下解决。尽管近期研究展现了全视觉方法的潜力，但由于其采用逐步生成的规划范式，存在显著的计算效率瓶颈。本研究提出编辑即推理（EAR）新范式，将视觉规划重构为单步图像转换任务。为剥离视觉识别对内在推理的影响，我们采用抽象谜题作为探测任务，并构建了AMAZE程序化生成数据集——该数据集包含经典迷宫问题和皇后问题，涵盖两种互补的视觉规划形式。AMAZE的抽象特性还支持对自回归和扩散模型进行像素级保真度与逻辑有效性的自动化评估。通过对主流专有及开源编辑模型的测试发现：所有模型在零样本设置下均表现不佳，但在基础尺度上进行微调后，能显著泛化至更大域内尺度及跨域尺度与几何结构。值得注意的是，即便在高端硬件上运行的最佳模型，其零样本效率仍无法媲美人类解题者，这揭示了神经视觉推理领域持续存在的差距。

English

Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and logical validity. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic scales enables remarkable generalization to larger in-domain scales and out-of-domain scales and geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.

探究图像编辑模型中的视觉规划能力

Probing Visual Planning in Image Editing Models

摘要

Support