元思维链：提升图像编辑的精细度与泛化能力

摘要

统一多模态理解/生成模型通过将细粒度理解融入其思维链（CoT）过程，显著提升了图像编辑性能。然而，一个关键问题尚未得到充分探索：何种形式的CoT与训练策略能共同提升理解粒度与泛化能力？为此，我们提出Meta-CoT范式，通过两级分解实现单图像编辑操作，其具备两个关键特性：（1）可分解性。我们发现任何编辑意图均可表示为三元组——（任务、目标、所需理解能力）。受此启发，Meta-CoT对编辑任务和目标进行双重解构，生成任务特定CoT并遍历所有目标上的编辑操作。这种分解增强了模型对编辑操作的理解粒度，并引导其在训练中学习三元组的每个元素，显著提升编辑能力。（2）泛化性。在第二级分解中，我们将编辑任务进一步拆解为五个基础元任务。研究发现，仅需对这五个元任务与三元组的其余两个元素进行联合训练，即可在多样化的未知编辑任务上实现强泛化能力。为更好地对齐模型编辑行为与CoT推理过程，我们引入CoT-编辑一致性奖励机制，促使模型在编辑过程中更精准有效地利用CoT信息。实验表明，本方法在21项编辑任务中实现整体15.8%的性能提升，且仅通过少量元任务训练即可有效泛化至未见过的编辑任务。代码、基准测试集及模型已发布于https://shiyi-zh0408.github.io/projectpages/Meta-CoT/。

English

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

元思维链：提升图像编辑的精细度与泛化能力

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

摘要

Support