ChatPaper.aiChatPaper

Jodi:通过联合建模实现视觉生成与理解的统一

Jodi: Unification of Visual Generation and Understanding via Joint Modeling

May 25, 2025
作者: Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen
cs.AI

摘要

视觉生成与理解是人类智能中两个紧密相连的方面,然而在机器学习领域,它们传统上被视为独立的任务。本文提出Jodi,一种扩散框架,通过联合建模图像域与多个标签域,统一了视觉生成与理解。具体而言,Jodi基于线性扩散变换器构建,并配备角色切换机制,使其能够执行三类特定任务:(1)联合生成,模型同时生成图像及多个标签;(2)可控生成,根据任意标签组合生成图像;(3)图像感知,从给定图像中一次性预测多个标签。此外,我们推出了Joint-1.6M数据集,包含从公开来源收集的20万张高质量图像、7个视觉领域的自动标注以及由大语言模型生成的描述。大量实验表明,Jodi在生成与理解任务上均表现出色,并展现出对更广泛视觉领域的强大扩展能力。代码已发布于https://github.com/VIPL-GENUN/Jodi。
English
Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at https://github.com/VIPL-GENUN/Jodi.

Summary

AI-Generated Summary

PDF202May 27, 2025