傾聽內在之聲：通過中間特徵反饋對齊ControlNet訓練

摘要

儘管文本到圖像擴散模型取得了顯著進展，但在生成輸出上實現精確的空間控制仍然具有挑戰性。ControlNet通過引入輔助條件模塊來解決這一問題，而ControlNet++則通過僅應用於最終去噪步驟的循環一致性損失進一步優化對齊。然而，這種方法忽略了中間生成階段，限制了其有效性。我們提出了InnerControl，這是一種在所有擴散步驟中強制執行空間一致性的訓練策略。我們的方法訓練輕量級卷積探針，以在每個去噪步驟中從中間UNet特徵重建輸入控制信號（例如，邊緣、深度）。這些探針即使在高度噪聲的潛在特徵中也能高效提取信號，從而為訓練提供偽地面真值控制。通過在整個擴散過程中最小化預測條件與目標條件之間的差異，我們的對齊損失提高了控制保真度和生成質量。結合ControlNet++等成熟技術，InnerControl在多種條件方法（例如，邊緣、深度）上實現了最先進的性能。

English

Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

傾聽內在之聲：通過中間特徵反饋對齊ControlNet訓練

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

摘要

Support