Show-o2:改進版原生統一多模態模型
Show-o2: Improved Native Unified Multimodal Models
June 18, 2025
作者: Jinheng Xie, Zhenheng Yang, Mike Zheng Shou
cs.AI
摘要
本文介紹了改進的原生統一多模態模型,即Show-o2,該模型利用自回歸建模和流匹配技術。基於3D因果變分自編碼器空間,通過空間(-時間)融合的雙路徑構建統一視覺表徵,實現了跨圖像和視頻模態的可擴展性,同時確保有效的多模態理解與生成。基於語言模型,自回歸建模和流匹配分別原生應用於語言頭和流頭,以促進文本標記預測和圖像/視頻生成。設計了兩階段訓練方案,有效學習並擴展至更大模型。最終的Show-o2模型展示了在處理多種模態(包括文本、圖像和視頻)的廣泛多模態理解與生成任務中的多樣性。代碼和模型已發佈於https://github.com/showlab/Show-o。
English
This paper presents improved native unified multimodal models, i.e.,
Show-o2, that leverage autoregressive modeling and flow matching. Built upon a
3D causal variational autoencoder space, unified visual representations are
constructed through a dual-path of spatial (-temporal) fusion, enabling
scalability across image and video modalities while ensuring effective
multimodal understanding and generation. Based on a language model,
autoregressive modeling and flow matching are natively applied to the language
head and flow head, respectively, to facilitate text token prediction and
image/video generation. A two-stage training recipe is designed to effectively
learn and scale to larger models. The resulting Show-o2 models demonstrate
versatility in handling a wide range of multimodal understanding and generation
tasks across diverse modalities, including text, images, and videos. Code and
models are released at https://github.com/showlab/Show-o.