REASONEDIT：邁向推理增強型圖像編輯模型

摘要

近期影像編輯模型展現出顯著進展。常見的架構設計將多模態大型語言模型（MLLM）編碼器與擴散解碼器耦合，例如Step1X-Edit和Qwen-Image-Edit系統：MLLM負責對參考影像與指令進行編碼，但在訓練期間保持凍結狀態。本研究證實，釋放MLLM的推理能力能進一步突破編輯模型的界限。具體而言，我們探索了「思考」與「反思」兩種推理機制，以增強指令理解與編輯精準度。基於此，我們提出的框架實現了「思考-編輯-反思」迴圈式的影像編輯：思考機制利用MLLM的世界知識解析抽象指令，而反思機制則審視編輯結果、自動修正非預期操作並判定終止時機。大量實驗表明，我們的推理方法在採用Step1X-Edit初始化DiT時（ReasonEdit-S）實現顯著效能提升：ImgEdit（+4.3%）、GEdit（+4.7%）和Kris（+8.2%）；與Qwen-Image-Edit整合時（ReasonEdit-Q）在GEdit和Kris指標上亦超越先前開源方法。

English

Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).

REASONEDIT：邁向推理增強型圖像編輯模型

REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

摘要

Support