無需訓練的多模態擴散變換器實現文本引導的色彩編輯

摘要

基於文本引導的圖像與視頻色彩編輯是一項基礎但尚未解決的問題，它需要對色彩屬性進行細粒度操控，包括反照率、光源色彩和環境光照，同時保持幾何形狀、材質屬性以及光與物質交互的物理一致性。現有的免訓練方法在各種編輯任務中具有廣泛的適用性，但在精確色彩控制方面存在困難，且常常在編輯與非編輯區域引入視覺不一致性。在本研究中，我們提出了ColorCtrl，這是一種利用現代多模態擴散變換器（MM-DiT）注意力機制的免訓練色彩編輯方法。通過針對性地操控注意力圖和值標記來解構結構與色彩，我們的方法實現了精確且一致的色彩編輯，並提供了屬性強度的詞級控制。我們的方法僅修改提示指定的目標區域，保持無關區域不受影響。在SD3和FLUX.1-dev上的大量實驗表明，ColorCtrl在編輯質量和一致性方面均超越了現有的免訓練方法，達到了業界領先水平。此外，在一致性方面，我們的方法超越了FLUX.1 Kontext Max和GPT-4o圖像生成等強勁商業模型。當擴展至如CogVideoX等視頻模型時，我們的方法展現出更大優勢，尤其是在保持時間連貫性和編輯穩定性方面。最後，我們的方法也適用於基於指令的編輯擴散模型，如Step1X-Edit和FLUX.1 Kontext dev，進一步證明了其多功能性。

English

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

無需訓練的多模態擴散變換器實現文本引導的色彩編輯

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

摘要

Support