다중 모달 디퓨전 트랜스포머를 활용한 학습 없이 텍스트 기반 색상 편집

초록

텍스트 기반 이미지 및 비디오 색상 편집은 기본적이면서도 아직 해결되지 않은 문제로, 알베도, 광원 색상, 주변 조명 등 색상 속성의 세밀한 조작이 필요하며, 동시에 기하학적 구조, 재질 특성, 빛-물질 상호작용의 물리적 일관성을 유지해야 합니다. 기존의 학습 없이 적용 가능한 방법들은 다양한 편집 작업에 폭넓게 사용될 수 있지만, 정확한 색상 제어에는 어려움을 겪으며 편집된 영역과 편집되지 않은 영역 모두에서 시각적 불일치를 초래하는 경우가 많습니다. 본 연구에서는 최신 멀티모달 디퓨전 트랜스포머(MM-DiT)의 어텐션 메커니즘을 활용한 학습 없는 색상 편집 방법인 ColorCtrl을 제안합니다. 어텐션 맵과 값 토큰을 대상으로 구조와 색상을 분리하여 조작함으로써, 이 방법은 정확하고 일관된 색상 편집과 함께 속성 강도의 단어 수준 제어를 가능하게 합니다. 또한, 이 방법은 프롬프트로 지정된 영역만을 수정하고 관련 없는 영역은 그대로 유지합니다. SD3와 FLUX.1-dev에 대한 광범위한 실험을 통해 ColorCtrl이 기존의 학습 없는 접근법을 능가하며 편집 품질과 일관성 모두에서 최신 기술 수준의 성능을 달성함을 입증했습니다. 더 나아가, 이 방법은 FLUX.1 Kontext Max 및 GPT-4o 이미지 생성과 같은 강력한 상용 모델들보다 일관성 측면에서 우수한 성능을 보였습니다. CogVideoX와 같은 비디오 모델로 확장했을 때, 이 접근법은 특히 시간적 일관성과 편집 안정성 유지 측면에서 더 큰 장점을 보였습니다. 마지막으로, 이 방법은 Step1X-Edit 및 FLUX.1 Kontext dev와 같은 지시 기반 편집 디퓨전 모델에도 일반화 가능하여 그 다양성을 추가로 입증했습니다.

English

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

다중 모달 디퓨전 트랜스포머를 활용한 학습 없이 텍스트 기반 색상 편집

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

초록

Support