无需训练的多模态扩散Transformer文本引导色彩编辑
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
August 12, 2025
作者: Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum
cs.AI
摘要
基于文本引导的图像与视频色彩编辑是一项基础性但尚未完全解决的难题,它要求对色彩属性进行精细操控,包括反照率、光源颜色和环境光照,同时保持几何结构、材质属性及光物交互的物理一致性。现有的无需训练的方法虽在各类编辑任务中具有广泛适用性,但在精确色彩控制上存在局限,常导致编辑区域与非编辑区域出现视觉不一致。本研究提出ColorCtrl,一种无需训练的色彩编辑方法,它利用现代多模态扩散变换器(MM-DiT)的注意力机制。通过有针对性地操控注意力图和值标记,实现结构与色彩的分离,我们的方法不仅支持精确且一致的色彩编辑,还能通过词语级别控制属性强度。该方法仅修改提示指定的目标区域,不影响无关部分。在SD3和FLUX.1-dev上的大量实验表明,ColorCtrl在编辑质量和一致性上均超越了现有无需训练的方法,达到了业界领先水平。此外,在一致性方面,我们的方法甚至超越了FLUX.1 Kontext Max和GPT-4o图像生成等强劲商业模型。当扩展至如CogVideoX等视频模型时,ColorCtrl展现出更大优势,特别是在保持时间连贯性和编辑稳定性方面。最后,我们的方法同样适用于基于指令的编辑扩散模型,如Step1X-Edit和FLUX.1 Kontext dev,进一步证明了其多功能性。
English
Text-guided color editing in images and videos is a fundamental yet unsolved
problem, requiring fine-grained manipulation of color attributes, including
albedo, light source color, and ambient lighting, while preserving physical
consistency in geometry, material properties, and light-matter interactions.
Existing training-free methods offer broad applicability across editing tasks
but struggle with precise color control and often introduce visual
inconsistency in both edited and non-edited regions. In this work, we present
ColorCtrl, a training-free color editing method that leverages the attention
mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By
disentangling structure and color through targeted manipulation of attention
maps and value tokens, our method enables accurate and consistent color
editing, along with word-level control of attribute intensity. Our method
modifies only the intended regions specified by the prompt, leaving unrelated
areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate
that ColorCtrl outperforms existing training-free approaches and achieves
state-of-the-art performances in both edit quality and consistency.
Furthermore, our method surpasses strong commercial models such as FLUX.1
Kontext Max and GPT-4o Image Generation in terms of consistency. When extended
to video models like CogVideoX, our approach exhibits greater advantages,
particularly in maintaining temporal coherence and editing stability. Finally,
our method also generalizes to instruction-based editing diffusion models such
as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.