擴散分類器理解組合性，但需滿足特定條件

摘要

理解视觉场景是人类智能的基础。尽管判别模型在计算机视觉领域取得了显著进展，但它们往往在组合理解方面表现欠佳。相比之下，近期生成式文本到图像扩散模型在合成复杂场景方面表现出色，暗示了其内在的组合能力。基于此，零样本扩散分类器被提出，旨在将扩散模型重新应用于判别任务。虽然先前的工作在判别组合场景中展示了有前景的结果，但由于基准测试数量有限且对模型成功条件的分析相对浅显，这些结果仍属初步。为解决这一问题，我们开展了一项全面研究，探讨扩散分类器在广泛组合任务中的判别能力。具体而言，我们的研究涵盖了三个扩散模型（SD 1.5、2.0，以及首次引入的3-m），跨越10个数据集和超过30项任务。此外，我们揭示了目标数据集领域对各自性能的影响；为隔离领域效应，我们引入了一个新的诊断基准Self-Bench，该基准由扩散模型自身生成的图像构成。最后，我们探讨了时间步权重的重要性，并揭示了领域差距与时间步敏感性之间的关系，特别是对于SD3-m模型。总之，扩散分类器能够理解组合性，但需满足特定条件！代码和数据集可在https://github.com/eugene6923/Diffusion-Classifiers-Compositionality获取。

English

Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.