Janus:解耦视觉编码以实现统一的多模态理解和生成
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
October 17, 2024
作者: Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo
cs.AI
摘要
本文介绍了Janus,这是一个将多模态理解和生成统一起来的自回归框架。先前的研究通常依赖于单个视觉编码器来完成这两个任务,比如Chameleon。然而,由于多模态理解和生成需要不同级别的信息粒度,这种方法可能导致性能不佳,特别是在多模态理解方面。为了解决这个问题,我们将视觉编码分解为独立的路径,同时仍然利用单一的统一Transformer架构进行处理。这种分解不仅缓解了视觉编码器在理解和生成中角色之间的冲突,还增强了框架的灵活性。例如,多模态理解和生成组件都可以独立选择它们最适合的编码方法。实验证明,Janus超越了先前的统一模型,并且与特定任务模型的性能相匹敌甚至超越。Janus的简单性、高灵活性和有效性使其成为下一代统一多模态模型的强有力候选。
English
In this paper, we introduce Janus, an autoregressive framework that unifies
multimodal understanding and generation. Prior research often relies on a
single visual encoder for both tasks, such as Chameleon. However, due to the
differing levels of information granularity required by multimodal
understanding and generation, this approach can lead to suboptimal performance,
particularly in multimodal understanding. To address this issue, we decouple
visual encoding into separate pathways, while still leveraging a single,
unified transformer architecture for processing. The decoupling not only
alleviates the conflict between the visual encoder's roles in understanding and
generation, but also enhances the framework's flexibility. For instance, both
the multimodal understanding and generation components can independently select
their most suitable encoding methods. Experiments show that Janus surpasses
previous unified model and matches or exceeds the performance of task-specific
models. The simplicity, high flexibility, and effectiveness of Janus make it a
strong candidate for next-generation unified multimodal models.Summary
AI-Generated Summary