확산 분류기는 구성성을 이해하지만, 조건이 적용됩니다.

초록

시각적 장면을 이해하는 것은 인간 지능의 근본적인 요소입니다. 판별 모델(discriminative models)이 컴퓨터 비전을 크게 발전시켰지만, 이들은 종종 구성적 이해(compositional understanding)에 어려움을 겪습니다. 반면, 최근의 생성적 텍스트-이미지 확산 모델(generative text-to-image diffusion models)은 복잡한 장면을 합성하는 데 탁월한 능력을 보여주며, 이는 내재된 구성적 능력을 시사합니다. 이를 바탕으로, 확산 모델을 판별 작업에 재활용하기 위해 제로샷 확산 분류기(zero-shot diffusion classifiers)가 제안되었습니다. 이전 연구는 판별적 구성 시나리오에서 유망한 결과를 보여주었지만, 소수의 벤치마크와 모델이 성공하는 조건에 대한 비교적 얕은 분석으로 인해 이러한 결과는 아직 예비적인 수준에 머물러 있습니다. 이를 해결하기 위해, 우리는 다양한 구성 작업에서 확산 분류기의 판별 능력에 대한 포괄적인 연구를 제시합니다. 구체적으로, 우리의 연구는 세 가지 확산 모델(SD 1.5, 2.0, 그리고 처음으로 3-m)을 포함하며, 10개의 데이터셋과 30개 이상의 작업을 다룹니다. 더 나아가, 우리는 대상 데이터셋 도메인이 각각의 성능에 미치는 역할을 밝히고, 도메인 효과를 분리하기 위해 확산 모델 자체가 생성한 이미지로 구성된 새로운 진단 벤치마크인 Self-Bench를 소개합니다. 마지막으로, 우리는 타임스텝 가중치의 중요성을 탐구하고, 특히 SD3-m의 경우 도메인 격차와 타임스텝 민감도 사이의 관계를 발견합니다. 요약하자면, 확산 분류기는 구성성을 이해하지만, 조건이 적용됩니다! 코드와 데이터셋은 https://github.com/eugene6923/Diffusion-Classifiers-Compositionality에서 확인할 수 있습니다.

English

Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.

확산 분류기는 구성성을 이해하지만, 조건이 적용됩니다.

Diffusion Classifiers Understand Compositionality, but Conditions Apply

초록

Support