超越简单编辑:X-Planner实现基于复杂指令的图像编辑
Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing
July 7, 2025
作者: Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, Krishna Kumar Singh
cs.AI
摘要
近期基于扩散模型的图像编辑技术在文本引导任务上取得了显著进展,但在处理复杂、间接指令时往往表现欠佳。此外,现有模型常面临身份特征保持不佳、非预期修改频发或过度依赖手动遮罩等问题。为解决这些挑战,我们提出了X-Planner,一种基于多模态大语言模型(MLLM)的规划系统,它能有效桥接用户意图与编辑模型能力。X-Planner采用思维链推理方法,系统地将复杂指令分解为更简单、明确的子指令。针对每个子指令,X-Planner自动生成精确的编辑类型和分割遮罩,无需人工干预,确保局部化且保持身份特征的编辑效果。同时,我们提出了一种新颖的自动化数据生成流程,用于训练X-Planner,该流程在现有基准测试及我们新引入的复杂编辑基准上均达到了最先进的性能。
English
Recent diffusion-based image editing methods have significantly advanced
text-guided tasks but often struggle to interpret complex, indirect
instructions. Moreover, current models frequently suffer from poor identity
preservation, unintended edits, or rely heavily on manual masks. To address
these challenges, we introduce X-Planner, a Multimodal Large Language Model
(MLLM)-based planning system that effectively bridges user intent with editing
model capabilities. X-Planner employs chain-of-thought reasoning to
systematically decompose complex instructions into simpler, clear
sub-instructions. For each sub-instruction, X-Planner automatically generates
precise edit types and segmentation masks, eliminating manual intervention and
ensuring localized, identity-preserving edits. Additionally, we propose a novel
automated pipeline for generating large-scale data to train X-Planner which
achieves state-of-the-art results on both existing benchmarks and our newly
introduced complex editing benchmark.