Janus: 統一されたマルチモーダル理解と生成のための視覚エンコーディングの切り離し

要旨

本論文では、Janusという自己回帰フレームワークを紹介し、マルチモーダルな理解と生成を統合します。従来の研究では、Chameleonなどのように両方のタスクに単一のビジュアルエンコーダが依存することが一般的でした。しかし、マルチモーダルな理解と生成に必要な情報の粒度が異なるため、このアプローチは特にマルチモーダルな理解において最適な性能を発揮しないことがあります。この問題に対処するために、ビジュアルエンコーディングを個々の経路に分離し、それでも単一の統一トランスフォーマーアーキテクチャを利用して処理します。この分離により、ビジュアルエンコーダの理解と生成における役割の衝突が緩和されるだけでなく、フレームワークの柔軟性が向上します。例えば、マルチモーダルな理解と生成のコンポーネントは、それぞれ最適なエンコーディング方法を独立して選択できます。実験の結果、Janusは以前の統一モデルを凌駕し、タスク固有のモデルの性能に匹敵またはそれを上回ることが示されました。Janusのシンプルさ、高い柔軟性、効果的な性能は、次世代の統一マルチモーダルモデルの有力な候補となります。

English

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

Janus: 統一されたマルチモーダル理解と生成のための視覚エンコーディングの切り離し

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

要旨

Support