Show-o2：增强型原生统一多模态模型

摘要

本文提出了改进的原生统一多模态模型，即Show-o2，该模型结合了自回归建模与流匹配技术。基于三维因果变分自编码器空间，通过时空融合的双路径构建统一视觉表示，实现了跨图像和视频模态的可扩展性，同时确保了有效的多模态理解与生成。依托于语言模型，自回归建模和流匹配分别原生应用于语言头部和流头部，以促进文本标记预测及图像/视频生成。设计了两阶段训练方案，旨在高效学习并扩展至更大规模模型。最终，Show-o2模型展现了在处理文本、图像、视频等多种模态的广泛多模态理解与生成任务中的强大适应性。代码与模型已发布于https://github.com/showlab/Show-o。

English

This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Show-o2：增强型原生统一多模态模型

Show-o2: Improved Native Unified Multimodal Models

摘要

Support