TUNA: 統一視覚表現を統合マルチモーダルモデルに適応させる手法

要旨

統合マルチモーダルモデル（UMM）は、単一の枠組み内でマルチモーダル理解と生成を統合的に行うことを目指す。本論文では、VAEエンコーダと表現エンコーダをカスケード接続することで統合的な連続的視覚表現を構築する、ネイティブUMMであるTUNAを提案する。この統合表現空間により、画像と動画に対する理解タスクと生成タスクの両方をエンドツーエンドで処理可能となる。分離型表現を用いた従来のUMMと比較して、TUNAの統合視覚空間は個別のエンコーダによる表現形式の不一致を回避し、理解と生成の両面で分離型アプローチを上回る性能を示す。さらに、強力な事前学習済み表現エンコーダほど全てのマルチモーダルタスクで一貫して優れた性能を発揮することから、表現エンコーダの重要性が明らかとなった。最後に、この統合環境下では、理解データと生成データの両方で共同訓練を行うことで、両タスクが相互に干渉ではなく協調的に改善されることを確認した。大規模なマルチモーダル理解・生成ベンチマーク実験により、TUNAが画像/動画理解、画像/動画生成、画像編集において最先端の結果を達成し、その統合表現設計の有効性と拡張性が実証された。

English

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.

TUNA: 統一視覚表現を統合マルチモーダルモデルに適応させる手法

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

要旨

Support