REASONEDIT: 推論機能を強化した画像編集モデルに向けて

要旨

近年、画像編集モデルは著しい進歩を見せている。一般的なアーキテクチャ設計として、マルチモーダル大規模言語モデル（MLLM）エンコーダと拡散デコーダの組み合わせが挙げられ、Step1X-EditやQwen-Image-Editなどのシステムで採用されている。これらではMLLMが参照画像と指示の両方を符号化するが、訓練中は固定されたままである。本研究では、MLLMの推論能力を解放することで、編集モデルの限界をさらに押し広げられることを実証する。具体的には、思考と反省という二つの推論メカニズムを探求し、指示理解と編集精度を向上させる。これに基づき、提案するフレームワークは「思考-編集-反省」ループによる画像編集を実現する。思考メカニズムはMLLMの世界知識を活用して抽象的な指示を解釈し、反省メカニズムは編集結果を検証して意図しない操作を自動修正し、終了タイミングを特定する。大規模な実験により、当社の推論アプローチがStep1X-EditからDiTを初期化した場合（ReasonEdit-S）、ImgEdit（+4.3%）、GEdit（+4.7%）、Kris（+8.2%）で顕著な性能向上を達成し、Qwen-Image-Editと統合した場合（ReasonEdit-Q）もGEditとKrisの両方で従来のオープンソース手法を上回ることを実証した。

English

Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).

REASONEDIT: 推論機能を強化した画像編集モデルに向けて

REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

要旨

Support