倾听内在之声：通过中间特征反馈优化ControlNet训练对齐

摘要

尽管文本到图像扩散模型已取得显著进展，但在生成输出上实现精确的空间控制仍具挑战。ControlNet通过引入辅助条件模块应对此问题，而ControlNet++则进一步通过在最终去噪步骤中应用循环一致性损失来优化对齐效果。然而，这种方法忽略了中间生成阶段，限制了其有效性。我们提出了InnerControl，一种训练策略，旨在所有扩散步骤中强制执行空间一致性。我们的方法训练轻量级卷积探针，从每个去噪步骤的UNet中间特征重建输入控制信号（如边缘、深度）。这些探针即使从高度噪声的潜在空间中也能高效提取信号，为训练提供伪真实控制。通过在整个扩散过程中最小化预测条件与目标条件之间的差异，我们的对齐损失提升了控制保真度和生成质量。结合ControlNet++等成熟技术，InnerControl在多种条件方法（如边缘、深度）上实现了最先进的性能。

English

Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

倾听内在之声：通过中间特征反馈优化ControlNet训练对齐

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

摘要

Support