ChatPaper.aiChatPaper

MIRA:多模态迭代推理图像编辑智能体

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

November 26, 2025
作者: Ziyun Zeng, Hang Hua, Jiebo Luo
cs.AI

摘要

指令引导的图像编辑为用户提供了一种直观的自然语言图像修改方式。然而,基于扩散的编辑模型往往难以准确解析复杂的用户指令——特别是涉及组合关系、上下文线索或指代表达的指令,导致编辑结果出现语义偏差或无法体现预期修改。我们通过提出MIRA(多模态迭代推理智能体)来解决这一问题:该轻量级即插即用多模态推理智能体通过"感知-推理-行动"的迭代循环执行编辑,有效模拟了多轮人机交互过程。与单次提示或静态规划不同,MIRA通过视觉反馈逐步预测原子级编辑指令。基于15万规模的多模态工具使用数据集MIRA-Editing及"SFT+GRPO"两阶段训练流程,MIRA能够对复杂编辑指令进行推理和编辑。当与Flux.1-Kontext、Step1X-Edit、Qwen-Image-Edit等开源图像编辑模型配合使用时,MIRA在语义一致性和感知质量上均实现显著提升,其性能达到甚至超越了GPT-Image、Nano-Banana等专有系统。
English
Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.
PDF92December 1, 2025