マルチモーダル拡散Transformerを用いたトレーニング不要のテキスト誘導型カラー編集

要旨

テキストガイドによる画像や動画の色編集は、基本的でありながら未解決の問題である。アルベド、光源色、環境光などの色属性を細かく操作しつつ、幾何学、材質特性、光と物質の相互作用における物理的一貫性を維持する必要がある。既存のトレーニング不要な手法は編集タスクに広く適用可能だが、正確な色制御に苦戦し、編集された領域と非編集領域の両方で視覚的な不整合を引き起こすことが多い。本研究では、現代のマルチモーダル拡散トランスフォーマー（MM-DiT）のアテンションメカニズムを活用したトレーニング不要な色編集手法、ColorCtrlを提案する。アテンションマップとバリュートークンをターゲット操作することで構造と色を分離し、正確で一貫性のある色編集と属性強度の単語レベル制御を可能にする。本手法はプロンプトで指定された意図した領域のみを変更し、無関係な領域はそのまま残す。SD3とFLUX.1-devでの広範な実験により、ColorCtrlが既存のトレーニング不要なアプローチを上回り、編集品質と一貫性の両方で最先端の性能を達成することが示された。さらに、本手法はFLUX.1 Kontext MaxやGPT-4o Image Generationなどの強力な商用モデルを一貫性の点で凌駕する。CogVideoXのような動画モデルに拡張すると、特に時間的整合性と編集安定性の維持においてより大きな利点を示す。最後に、本手法はStep1X-EditやFLUX.1 Kontext devなどの指示ベースの編集拡散モデルにも一般化され、その汎用性をさらに実証している。

English

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

マルチモーダル拡散Transformerを用いたトレーニング不要のテキスト誘導型カラー編集

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

要旨

Support