超越简单编辑：X-Planner实现基于复杂指令的图像编辑

摘要

近期基于扩散模型的图像编辑技术在文本引导任务上取得了显著进展，但在处理复杂、间接指令时往往表现欠佳。此外，现有模型常面临身份特征保持不佳、非预期修改频发或过度依赖手动遮罩等问题。为解决这些挑战，我们提出了X-Planner，一种基于多模态大语言模型（MLLM）的规划系统，它能有效桥接用户意图与编辑模型能力。X-Planner采用思维链推理方法，系统地将复杂指令分解为更简单、明确的子指令。针对每个子指令，X-Planner自动生成精确的编辑类型和分割遮罩，无需人工干预，确保局部化且保持身份特征的编辑效果。同时，我们提出了一种新颖的自动化数据生成流程，用于训练X-Planner，该流程在现有基准测试及我们新引入的复杂编辑基准上均达到了最先进的性能。

English

Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

超越简单编辑：X-Planner实现基于复杂指令的图像编辑

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

摘要

Support