生成的デノイジングと識別的目標を整合させることで視覚知覚のための拡散モデルが解放される

要旨

画像生成の成功に伴い、生成拡散モデルは識別タスクにも採用が増えています。これは、ピクセル生成が統一された知覚インターフェースを提供するためです。しかし、生成的なノイズ除去プロセスをそのまま識別目的に転用すると、これまでほとんど取り上げられてこなかった重要なギャップが明らかになります。生成モデルは、最終的な分布が妥当であれば中間のサンプリングエラーを許容しますが、参照画像セグメンテーションのような難しいマルチモーダルタスクでは、識別タスクでは全過程で厳密な精度が求められます。このギャップに着目し、私たちは生成拡散プロセスと知覚タスクの整合性を分析・強化し、ノイズ除去中に知覚品質がどのように進化するかに焦点を当てました。その結果、(1)初期のノイズ除去ステップが知覚品質に不釣り合いに寄与することがわかり、これを受けて各タイムステップの寄与を反映した学習目標を提案しました。(2)後期のノイズ除去ステップでは予期せぬ知覚品質の低下が見られ、トレーニングとノイズ除去の分布シフトに対する感度が浮き彫りになり、これを拡散モデルに特化したデータ拡張で対処しました。(3)生成プロセスはインタラクティブ性を独自に可能にし、多段階インタラクションにおける修正プロンプトに適応可能な制御可能なユーザーインターフェースとして機能します。これらの洞察により、アーキテクチャを変更することなく拡散ベースの知覚モデルを大幅に改善し、深度推定、参照画像セグメンテーション、汎用知覚タスクにおいて最先端の性能を達成しました。コードはhttps://github.com/ziqipang/ADDPで公開されています。

English

With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.

生成的デノイジングと識別的目標を整合させることで視覚知覚のための拡散モデルが解放される

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

要旨

Support