Show-o2：改進版原生統一多模態模型

摘要

本文介紹了改進的原生統一多模態模型，即Show-o2，該模型利用自回歸建模和流匹配技術。基於3D因果變分自編碼器空間，通過空間（-時間）融合的雙路徑構建統一視覺表徵，實現了跨圖像和視頻模態的可擴展性，同時確保有效的多模態理解與生成。基於語言模型，自回歸建模和流匹配分別原生應用於語言頭和流頭，以促進文本標記預測和圖像/視頻生成。設計了兩階段訓練方案，有效學習並擴展至更大模型。最終的Show-o2模型展示了在處理多種模態（包括文本、圖像和視頻）的廣泛多模態理解與生成任務中的多樣性。代碼和模型已發佈於https://github.com/showlab/Show-o。

English

This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Show-o2：改進版原生統一多模態模型

Show-o2: Improved Native Unified Multimodal Models

摘要

Support