REASONEDIT：迈向推理增强型图像编辑模型

摘要

近期图像编辑模型的研究取得了显著进展。一种常见的架构设计是将多模态大语言模型（MLLM）编码器与扩散解码器相结合，例如Step1X-Edit和Qwen-Image-Edit系统：MLLM负责对参考图像和编辑指令进行编码，但在训练过程中保持参数冻结。本研究证明，释放MLLM的推理能力能够进一步拓展编辑模型的边界。具体而言，我们探索了思维与反思两种推理机制，以增强指令理解与编辑精度。基于此，我们提出的框架实现了"思维-编辑-反思"循环的图像编辑流程：思维机制利用MLLM的世界知识解析抽象指令，而反思机制则通过审查编辑结果、自动修正非预期操作并确定终止时机。大量实验表明，我们的推理方法在Step1X-Edit初始化DiT的设定下（ReasonEdit-S）实现了显著性能提升，在ImgEdit（+4.3%）、GEdit（+4.7%）和Kris（+8.2%）指标上均有进步；当与Qwen-Image-Edit结合时（ReasonEdit-Q），在GEdit和Kris基准上亦超越了既往开源方法。

English

Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).

REASONEDIT：迈向推理增强型图像编辑模型

REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

摘要

Support