Jodi: 시각적 생성과 이해의 통합을 위한 공동 모델링

초록

시각적 생성과 이해는 인간 지능의 깊이 연결된 두 가지 측면이지만, 기계 학습에서는 전통적으로 별개의 작업으로 다뤄져 왔습니다. 본 논문에서는 이미지 도메인과 다중 레이블 도메인을 공동으로 모델링함으로써 시각적 생성과 이해를 통합하는 확산 프레임워크인 Jodi를 제안합니다. 구체적으로, Jodi는 선형 확산 트랜스포머와 역할 전환 메커니즘을 기반으로 구축되어, 다음과 같은 세 가지 특정 유형의 작업을 수행할 수 있습니다: (1) 공동 생성, 모델이 이미지와 다중 레이블을 동시에 생성하는 작업; (2) 제어 가능한 생성, 레이블의 조합에 따라 이미지를 생성하는 작업; (3) 이미지 인지, 주어진 이미지에서 다중 레이블을 한 번에 예측하는 작업. 또한, 공개 소스에서 수집된 200,000개의 고품질 이미지, 7개의 시각적 도메인에 대한 자동 레이블, 그리고 LLM 생성 캡션을 포함한 Joint-1.6M 데이터셋을 소개합니다. 광범위한 실험을 통해 Jodi가 생성 및 이해 작업 모두에서 탁월한 성능을 보이며, 더 넓은 범위의 시각적 도메인에 대한 강력한 확장성을 보여줌을 입증했습니다. 코드는 https://github.com/VIPL-GENUN/Jodi에서 확인할 수 있습니다.

English

Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at https://github.com/VIPL-GENUN/Jodi.

Jodi: 시각적 생성과 이해의 통합을 위한 공동 모델링

Jodi: Unification of Visual Generation and Understanding via Joint Modeling

초록

Support