MIRA:多模態迭代推理影像編輯代理
MIRA: Multimodal Iterative Reasoning Agent for Image Editing
November 26, 2025
作者: Ziyun Zeng, Hang Hua, Jiebo Luo
cs.AI
摘要
指令引導式影像編輯為用戶提供了一種直觀的自然語言影像修改方式。然而,基於擴散模型的編輯系統往往難以精確解析複雜的用戶指令——特別是涉及組合關係、上下文語境或指代表達的內容——容易導致編輯結果出現語義偏差或未能體現預期修改效果。為解決此問題,我們提出MIRA(多模態迭代推理智能體),這款輕量級即插即用型多模態推理智能體通過「感知-推理-行動」的迭代循環執行編輯任務,有效模擬人機多輪交互過程。有別於單次指令或靜態規劃,MIRA會逐步預測原子級編輯指令,並利用視覺反饋進行決策。我們構建的15萬規模多模態工具使用數據集MIRA-Editing,結合兩階段SFT+GRPO訓練流程,使MIRA能對複雜編輯指令執行推理與編輯。當與Flux.1-Kontext、Step1X-Edit、Qwen-Image-Edit等開源影像編輯模型配合使用時,MIRA在語義一致性和感知質量方面均實現顯著提升,其性能可媲美甚至超越GPT-Image、Nano-Banana等專有系統。
English
Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.