생성적 노이즈 제거와 판별적 목표의 정렬이 시각 인식을 위한 디퓨전의 잠재력을 발휘하다

초록

이미지 생성의 성공과 함께, 생성적 확산 모델은 픽셀 생성이 통합된 인식 인터페이스를 제공하기 때문에 판별적 작업에 점점 더 많이 채택되고 있습니다. 그러나 생성적 노이즈 제거 프로세스를 판별적 목적으로 직접 재사용하는 것은 이전에 거의 다루어지지 않은 중요한 격차를 드러냅니다. 생성 모델은 최종 분포가 그럴듯하다면 중간 샘플링 오류를 허용하지만, 판별적 작업은 참조 이미지 분할과 같은 도전적인 다중 모달 작업에서 볼 수 있듯이 전체 과정에서 엄격한 정확도를 요구합니다. 이러한 격차에 동기를 부여받아, 우리는 생성적 확산 프로세스와 인식 작업 간의 정렬을 분석하고 강화하며, 노이즈 제거 과정 중 인식 품질이 어떻게 진화하는지에 초점을 맞춥니다. 우리는 다음과 같은 사실을 발견했습니다: (1) 초기 노이즈 제거 단계가 인식 품질에 불균형적으로 큰 기여를 하여, 다양한 시간 단계의 기여를 반영한 맞춤형 학습 목표를 제안하게 되었습니다; (2) 후기 노이즈 제거 단계에서 예상치 못한 인식 품질 저하가 나타나며, 이는 훈련-노이즈 제거 분포 변화에 대한 민감성을 강조하며, 이를 해결하기 위해 확산 모델에 맞춤화된 데이터 증강을 제안합니다; (3) 생성적 프로세스는 상호작용성을 독특하게 가능하게 하여, 다중 라운드 상호작용에서 수정 프롬프트에 적응 가능한 제어 가능한 사용자 인터페이스 역할을 합니다. 우리의 통찰력은 아키텍처 변경 없이도 확산 기반 인식 모델을 크게 개선하여, 깊이 추정, 참조 이미지 분할, 그리고 일반적인 인식 작업에서 최첨단 성능을 달성했습니다. 코드는 https://github.com/ziqipang/ADDP에서 확인할 수 있습니다.

English

With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.

생성적 노이즈 제거와 판별적 목표의 정렬이 시각 인식을 위한 디퓨전의 잠재력을 발휘하다

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

초록

Support