Meta-CoT : Amélioration de la granularité et de la généralisation dans l'édition d'images

Résumé

Les modèles unifiés de compréhension/génération multimodaux ont démontré des performances améliorées en édition d'image en intégrant une compréhension fine dans leur processus de Chaîne de Pensée (CoT). Cependant, une question cruciale reste insuffisamment explorée : quelles formes de CoT et quelles stratégies d'entraînement peuvent conjointement améliorer la granularité de compréhension et la généralisation ? Pour y répondre, nous proposons Meta-CoT, un paradigme effectuant une décomposition à deux niveaux de toute opération d'édition sur image unique, possédant deux propriétés clés : (1) Décomposabilité. Nous observons que toute intention d'édition peut être représentée par un triplet - (tâche, cible, capacité de compréhension requise). Inspirés par cela, Meta-CoT décompose à la fois la tâche d'édition et la cible, générant une CoT spécifique à la tâche et parcourant les opérations d'édition sur toutes les cibles. Cette décomposition améliore la granularité de compréhension des opérations d'édition par le modèle et le guide pour apprendre chaque élément du triplet pendant l'entraînement, améliorant substantiellement la capacité d'édition. (2) Généralisabilité. Au second niveau de décomposition, nous décomposons davantage les tâches d'édition en cinq méta-tâches fondamentales. Nous constatons que l'entraînement sur ces cinq méta-tâches, conjointement avec les deux autres éléments du triplet, suffit à obtenir une forte généralisation sur diverses tâches d'édition non vues. Pour mieux aligner le comportement d'édition du modèle avec son raisonnement CoT, nous introduisons la Récompense de Cohérence CoT-Édition, qui encourage une utilisation plus précise et efficace des informations CoT pendant l'édition. Les expériences démontrent que notre méthode obtient une amélioration globale de 15,8 % sur 21 tâches d'édition, et généralise efficacement à des tâches d'édition non vues après un entraînement sur seulement un petit ensemble de méta-tâches. Notre code, benchmark et modèle sont disponibles à l'adresse https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

English

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

Meta-CoT : Amélioration de la granularité et de la généralisation dans l'édition d'images

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Résumé

Support