Meta-CoT: 이미지 편집에서 세분성과 일반화 능력 향상

초록

통합된 다중 모드 이해/생성 모델은 Chain-of-Thought(CoT) 과정에 세밀한 이해를 도입함으로써 이미지 편집 성능을 향상시켜 왔습니다. 그러나 중요한 질문인, 어떤 형태의 CoT와 학습 전략이 이해의 세분성과 일반화 능력을 동시에 향상시킬 수 있는지에 대해서는 충분히 탐구되지 않았습니다. 이를 해결하기 위해 우리는 두 가지 핵심 속성을 가진, 단일 이미지 편집 연산을 두 수준으로 분해하는 패러다임인 Meta-CoT를 제안합니다. (1) 분해 가능성. 우리는 모든 편집 의도가 (작업, 대상, 필요한 이해 능력)이라는 삼중항으로 표현될 수 있음을 관찰했습니다. 이에 착안하여 Meta-CoT는 편집 작업과 대상을 모두 분해하여 작업별 CoT를 생성하고 모든 대상에 대한 편집 연산을 순회합니다. 이 분해는 모델의 편집 연산에 대한 이해 세분성을 높이고, 학습 과정에서 삼중항의 각 요소를 학습하도록 유도하여 편집 능력을 크게 향상시킵니다. (2) 일반화 가능성. 두 번째 분해 수준에서는 편집 작업을 다섯 가지 기본 메타 작업으로 추가로 분해합니다. 우리는 이 다섯 가지 메타 작업과 삼중항의 나머지 두 요소를 함께 학습하는 것만으로도 다양한 보지 못한(unseen) 편집 작업에 대해 강력한 일반화 성능을 달성할 수 있음을 발견했습니다. 모델의 편집 행동과 CoT 추론을 더 잘 일치시키기 위해, 우리는 CoT-편집 일관성 보상을 도입했습니다. 이는 편집 과정에서 CoT 정보를 더 정확하고 효과적으로 활용하도록 장려합니다. 실험 결과, 우리의 방법이 21개 편집 작업 전반에 걸쳐 평균 15.8%의 성능 향상을 달성했으며, 소규모의 메타 작업 집합만으로 학습했을 때도 보지 못한 편집 작업에 효과적으로 일반화함을 입증했습니다. 우리의 코드, 벤치마크 및 모델은 https://shiyi-zh0408.github.io/projectpages/Meta-CoT/에서 공개되었습니다.

English

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

Meta-CoT: 이미지 편집에서 세분성과 일반화 능력 향상

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

초록

Support