内なる声に耳を傾ける：中間層フィードバックによるControlNetトレーニングの最適化

要旨

テキストから画像への拡散モデルにおいて大きな進展が見られるものの、生成された出力に対する精密な空間制御の実現は依然として課題である。ControlNetは、補助的な条件付けモジュールを導入することでこの課題に対処し、ControlNet++は最終的なノイズ除去ステップにのみ適用されるサイクル一貫性損失を通じてアライメントをさらに洗練させている。しかし、このアプローチは中間生成段階を無視しており、その有効性が制限されている。本研究では、すべての拡散ステップにわたって空間的一貫性を強制するトレーニング戦略であるInnerControlを提案する。本手法では、軽量な畳み込みプローブをトレーニングし、各ノイズ除去ステップにおける中間UNet特徴量から入力制御信号（例：エッジ、深度）を再構築する。これらのプローブは、高度にノイジーな潜在変数からも効率的に信号を抽出し、トレーニングのための疑似グラウンドトゥルース制御を可能にする。拡散プロセス全体を通じて予測条件と目標条件の不一致を最小化することにより、本手法のアライメント損失は制御の忠実度と生成品質の両方を向上させる。ControlNet++などの確立された技術と組み合わせることで、InnerControlは多様な条件付け方法（例：エッジ、深度）において最先端の性能を達成する。

English

Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

内なる声に耳を傾ける：中間層フィードバックによるControlNetトレーニングの最適化

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

要旨

Support