拡散分類器は構成性を理解するが、条件が適用される

要旨

視覚シーンの理解は、人間の知能にとって基本的な能力である。識別モデルはコンピュータビジョンを大きく進歩させたが、構成的な理解にはしばしば苦戦する。一方、最近の生成的テキスト-to-画像拡散モデルは、複雑なシーンの合成に優れており、内在的な構成的能力を示唆している。これを基盤として、拡散モデルを識別タスクに転用するゼロショット拡散分類器が提案されている。先行研究では、識別的構成的シナリオで有望な結果を示したが、これらの結果は、ベンチマークの数が少なく、モデルが成功する条件の分析が比較的浅いため、まだ予備的なものである。この問題に対処するため、我々は、広範な構成的タスクにおける拡散分類器の識別能力について包括的な研究を提示する。具体的には、我々の研究は、3つの拡散モデル（SD 1.5、2.0、そして初めて3-m）をカバーし、10のデータセットと30以上のタスクに及ぶ。さらに、ターゲットデータセットのドメインがそれぞれの性能に果たす役割を明らかにするため、拡散モデル自身が作成した画像からなる新しい診断ベンチマーク「Self-Bench」を導入する。最後に、タイムステップの重み付けの重要性を探り、特にSD3-mにおいて、ドメインギャップとタイムステップ感度の関係を明らかにする。要約すると、拡散分類器は構成的理解が可能であるが、条件が適用される！コードとデータセットはhttps://github.com/eugene6923/Diffusion-Classifiers-Compositionalityで利用可能である。

English

Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.

拡散分類器は構成性を理解するが、条件が適用される

Diffusion Classifiers Understand Compositionality, but Conditions Apply

要旨

Support