Meta-CoT：提升圖像編輯的細粒度控制與泛化能力

摘要

統一多模態理解/生成模型通過將細粒度理解融入其思維鏈過程，在圖像編輯性能上展現出顯著提升。然而，一個關鍵問題仍未得到充分探索：何種形式的思維鏈與訓練策略能共同增強理解粒度與泛化能力？為此，我們提出Meta-CoT範式，該範式通過雙層分解機制處理單圖像編輯操作，具備兩大關鍵特性：（1）可分解性。我們發現任何編輯意圖均可表示為三元組——（任務、目標、所需理解能力）。受此啟發，Meta-CoT同時分解編輯任務與目標，生成任務專屬的思維鏈，並對所有目標執行編輯操作遍歷。這種分解機制不僅增強模型對編輯操作的細粒度理解，更引導其在訓練中學習三元組的各個要素，從而顯著提升編輯能力。（2）泛化性。在第二層分解中，我們將編輯任務進一步解構為五個基礎元任務。實驗表明，僅需對這五個元任務連同三元組的其餘兩個要素進行訓練，即可在多元未見編輯任務上實現強泛化性能。為進一步校準模型編輯行為與思維鏈推理的協同性，我們引入思維鏈-編輯一致性獎勵機制，促使模型在編輯過程中更精準有效地利用思維鏈信息。實驗結果證實，本方法在21項編輯任務中實現整體15.8%的性能提升，且僅需對少量元任務進行訓練即可有效泛化至未見編輯任務。我們的代碼、基準測試與模型已開源於：https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

English

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

Meta-CoT：提升圖像編輯的細粒度控制與泛化能力

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

摘要

Support