내면의 목소리에 귀 기울이기: 중간 특징 피드백을 통한 ControlNet 학습 정렬

초록

텍스트-이미지 확산 모델에서 상당한 진전이 있었음에도 불구하고, 생성된 출력물에 대한 정확한 공간적 제어를 달성하는 것은 여전히 어려운 과제로 남아 있습니다. ControlNet은 보조 조건화 모듈을 도입하여 이 문제를 해결하고, ControlNet++는 최종 노이즈 제거 단계에만 적용되는 주기 일관성 손실을 통해 정렬을 더욱 개선합니다. 그러나 이 접근 방식은 중간 생성 단계를 간과하여 그 효과가 제한적입니다. 우리는 InnerControl을 제안하며, 이는 모든 확산 단계에 걸쳐 공간적 일관성을 강화하는 훈련 전략입니다. 우리의 방법은 모든 노이즈 제거 단계에서 중간 UNet 특징으로부터 입력 제어 신호(예: 에지, 깊이)를 재구성하기 위해 경량 컨볼루션 프로브를 훈련합니다. 이러한 프로브는 고도로 노이즈가 있는 잠재 공간에서도 효율적으로 신호를 추출하여 훈련을 위한 가짜 실측 제어를 가능하게 합니다. 전체 확산 과정에서 예측된 조건과 목표 조건 간의 불일치를 최소화함으로써, 우리의 정렬 손실은 제어 충실도와 생성 품질을 모두 개선합니다. ControlNet++와 같은 기존 기술과 결합된 InnerControl은 다양한 조건화 방법(예: 에지, 깊이)에서 최첨단 성능을 달성합니다.

English

Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

내면의 목소리에 귀 기울이기: 중간 특징 피드백을 통한 ControlNet 학습 정렬

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

초록

Support