Show-o2:增强型原生统一多模态模型
Show-o2: Improved Native Unified Multimodal Models
June 18, 2025
作者: Jinheng Xie, Zhenheng Yang, Mike Zheng Shou
cs.AI
摘要
本文提出了改进的原生统一多模态模型,即Show-o2,该模型结合了自回归建模与流匹配技术。基于三维因果变分自编码器空间,通过时空融合的双路径构建统一视觉表示,实现了跨图像和视频模态的可扩展性,同时确保了有效的多模态理解与生成。依托于语言模型,自回归建模和流匹配分别原生应用于语言头部和流头部,以促进文本标记预测及图像/视频生成。设计了两阶段训练方案,旨在高效学习并扩展至更大规模模型。最终,Show-o2模型展现了在处理文本、图像、视频等多种模态的广泛多模态理解与生成任务中的强大适应性。代码与模型已发布于https://github.com/showlab/Show-o。
English
This paper presents improved native unified multimodal models, i.e.,
Show-o2, that leverage autoregressive modeling and flow matching. Built upon a
3D causal variational autoencoder space, unified visual representations are
constructed through a dual-path of spatial (-temporal) fusion, enabling
scalability across image and video modalities while ensuring effective
multimodal understanding and generation. Based on a language model,
autoregressive modeling and flow matching are natively applied to the language
head and flow head, respectively, to facilitate text token prediction and
image/video generation. A two-stage training recipe is designed to effectively
learn and scale to larger models. The resulting Show-o2 models demonstrate
versatility in handling a wide range of multimodal understanding and generation
tasks across diverse modalities, including text, images, and videos. Code and
models are released at https://github.com/showlab/Show-o.