超越簡單編輯：X-Planner 實現基於複雜指令的圖像編輯

摘要

近期基於擴散模型的圖像編輯方法在文本引導任務上取得了顯著進展，但在解讀複雜、間接的指令時往往力不從心。此外，現有模型常面臨身份特徵保存不佳、非預期編輯或過度依賴手動遮罩等問題。為應對這些挑戰，我們推出了X-Planner，這是一個基於多模態大型語言模型（MLLM）的規劃系統，它能有效橋接用戶意圖與編輯模型能力。X-Planner採用思維鏈推理，將複雜指令系統性地分解為更簡單、明確的子指令。針對每個子指令，X-Planner自動生成精確的編輯類型和分割遮罩，無需人工干預，確保局部化且保持身份特徵的編輯。此外，我們提出了一種新穎的自動化流程，用於生成大規模數據來訓練X-Planner，該方法在現有基準測試及我們新引入的複雜編輯基準上均達到了業界領先水平。

English

Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

超越簡單編輯：X-Planner 實現基於複雜指令的圖像編輯

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

摘要

Support