Meta-CoT：画像編集における粒度と汎化性の向上

要旨

統合的なマルチモーダル理解・生成モデルは、細粒度の理解をChain-of-Thought（CoT）プロセスに組み込むことで、画像編集性能の向上を示してきた。しかし、どの形式のCoTと学習戦略が理解の粒度と汎化性能を同時に強化できるかという核心的な問いは、十分に探究されていない。この課題に対処するため、我々はMeta-CoTを提案する。これは単一画像編集操作を二段階で分解するパラダイムであり、以下の二つの主要特性を持つ：（1）分解可能性。あらゆる編集意図は（タスク、対象、必要な理解能力）の三つ組で表現できることを見出した。これに着想を得て、Meta-CoTは編集タスクと対象の両方を分解し、タスク固有のCoTを生成するとともに全ての対象に対する編集操作を横断的に実行する。この分解により編集操作に対する理解粒度が強化され、学習時に三つ組の各要素を習得するよう導くことで、編集能力を大幅に向上させる。（2）一般化性。第二の分解段階では、編集タスクを5つの基本メタタスクにさらに細分化する。これら5つのメタタスクを三つ組の他の2要素と共に学習することで、多様な未見の編集タスクに対する強力な汎化が達成可能であることを確認した。さらに編集動作とCoT推論の整合性を高めるため、CoT-編集一貫性報酬を導入し、編集時のCoT情報のより正確かつ効果的な利用を促進する。実験では、本手法が21の編集タスクにおいて平均15.8%の改善を達成し、少数のメタタスクのみで学習した場合でも未見の編集タスクに効果的に汎化することを実証した。コード、ベンチマーク、モデルはhttps://shiyi-zh0408.github.io/projectpages/Meta-CoT/で公開している。

English

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

Meta-CoT：画像編集における粒度と汎化性の向上

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

要旨

Support