無需訓練的多模態擴散變換器實現文本引導的色彩編輯
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
August 12, 2025
作者: Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum
cs.AI
摘要
基於文本引導的圖像與視頻色彩編輯是一項基礎但尚未解決的問題,它需要對色彩屬性進行細粒度操控,包括反照率、光源色彩和環境光照,同時保持幾何形狀、材質屬性以及光與物質交互的物理一致性。現有的免訓練方法在各種編輯任務中具有廣泛的適用性,但在精確色彩控制方面存在困難,且常常在編輯與非編輯區域引入視覺不一致性。在本研究中,我們提出了ColorCtrl,這是一種利用現代多模態擴散變換器(MM-DiT)注意力機制的免訓練色彩編輯方法。通過針對性地操控注意力圖和值標記來解構結構與色彩,我們的方法實現了精確且一致的色彩編輯,並提供了屬性強度的詞級控制。我們的方法僅修改提示指定的目標區域,保持無關區域不受影響。在SD3和FLUX.1-dev上的大量實驗表明,ColorCtrl在編輯質量和一致性方面均超越了現有的免訓練方法,達到了業界領先水平。此外,在一致性方面,我們的方法超越了FLUX.1 Kontext Max和GPT-4o圖像生成等強勁商業模型。當擴展至如CogVideoX等視頻模型時,我們的方法展現出更大優勢,尤其是在保持時間連貫性和編輯穩定性方面。最後,我們的方法也適用於基於指令的編輯擴散模型,如Step1X-Edit和FLUX.1 Kontext dev,進一步證明了其多功能性。
English
Text-guided color editing in images and videos is a fundamental yet unsolved
problem, requiring fine-grained manipulation of color attributes, including
albedo, light source color, and ambient lighting, while preserving physical
consistency in geometry, material properties, and light-matter interactions.
Existing training-free methods offer broad applicability across editing tasks
but struggle with precise color control and often introduce visual
inconsistency in both edited and non-edited regions. In this work, we present
ColorCtrl, a training-free color editing method that leverages the attention
mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By
disentangling structure and color through targeted manipulation of attention
maps and value tokens, our method enables accurate and consistent color
editing, along with word-level control of attribute intensity. Our method
modifies only the intended regions specified by the prompt, leaving unrelated
areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate
that ColorCtrl outperforms existing training-free approaches and achieves
state-of-the-art performances in both edit quality and consistency.
Furthermore, our method surpasses strong commercial models such as FLUX.1
Kontext Max and GPT-4o Image Generation in terms of consistency. When extended
to video models like CogVideoX, our approach exhibits greater advantages,
particularly in maintaining temporal coherence and editing stability. Finally,
our method also generalizes to instruction-based editing diffusion models such
as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.