Show-o2：改良型ネイティブ統合マルチモーダルモデル

要旨

本論文では、自己回帰モデリングとフローマッチングを活用した改良型ネイティブ統合マルチモーダルモデル、すなわちShow-o2を提案する。3D因果的変分オートエンコーダ空間を基盤として、空間的（時間的）融合のデュアルパスを通じて統合された視覚表現を構築し、画像と動画のモダリティにわたるスケーラビリティを確保しながら、効果的なマルチモーダル理解と生成を実現する。言語モデルを基盤として、自己回帰モデリングとフローマッチングをそれぞれ言語ヘッドとフローヘッドにネイティブに適用し、テキストトークンの予測と画像/動画の生成を促進する。2段階のトレーニングレシピを設計し、より大規模なモデルへの効果的な学習とスケーリングを可能にする。結果として得られたShow-o2モデルは、テキスト、画像、動画を含む多様なモダリティにわたる幅広いマルチモーダル理解と生成タスクを処理する汎用性を実証する。コードとモデルはhttps://github.com/showlab/Show-oで公開されている。

English

This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Show-o2：改良型ネイティブ統合マルチモーダルモデル

Show-o2: Improved Native Unified Multimodal Models

要旨

Support