Jodi: 視覚生成と理解の統合 - 共同モデリングによるアプローチ

要旨

視覚生成と理解は、人間の知性において深く結びついた二つの側面であるが、機械学習においては伝統的に別々のタスクとして扱われてきた。本論文では、画像領域と複数のラベル領域を共同でモデル化することにより、視覚生成と理解を統合する拡散フレームワーク「Jodi」を提案する。具体的には、Jodiは線形拡散トランスフォーマーと役割切り替えメカニズムを基盤として構築されており、以下の3つの特定のタスクを実行可能である：(1) 画像と複数のラベルを同時に生成する共同生成、(2) 任意のラベルの組み合わせに基づいて画像を生成する制御可能な生成、(3) 与えられた画像から複数のラベルを一度に予測する画像知覚。さらに、公開ソースから収集された20万枚の高品質画像、7つの視覚領域に対する自動ラベル、およびLLM生成のキャプションを含むJoint-1.6Mデータセットを提示する。広範な実験により、Jodiが生成と理解の両タスクにおいて優れた性能を発揮し、より広範な視覚領域への強力な拡張性を示すことが実証された。コードはhttps://github.com/VIPL-GENUN/Jodiで公開されている。

English

Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at https://github.com/VIPL-GENUN/Jodi.

Jodi: 視覚生成と理解の統合 - 共同モデリングによるアプローチ

Jodi: Unification of Visual Generation and Understanding via Joint Modeling

要旨

Support