Show-o2: 개선된 네이티브 통합 멀티모달 모델

초록

본 논문은 자동회귀 모델링과 플로우 매칭을 활용한 개선된 네이티브 통합 멀티모달 모델, 즉 Show-o2를 소개한다. 3D 인과적 변이형 오토인코더 공간을 기반으로, 공간적(-시간적) 융합의 이중 경로를 통해 통합된 시각적 표현이 구성되며, 이는 이미지와 비디오 양식에 걸쳐 확장성을 보장하면서도 효과적인 멀티모달 이해와 생성을 가능하게 한다. 언어 모델을 기반으로, 자동회귀 모델링과 플로우 매칭은 각각 언어 헤드와 플로우 헤드에 네이티브하게 적용되어 텍스트 토큰 예측과 이미지/비디오 생성을 용이하게 한다. 더 큰 모델로의 효과적인 학습과 확장을 위해 두 단계의 훈련 레시피가 설계되었다. 결과적으로 Show-o2 모델은 텍스트, 이미지, 비디오 등 다양한 양식에 걸친 광범위한 멀티모달 이해 및 생성 작업을 처리하는 데 있어 다재다능함을 보여준다. 코드와 모델은 https://github.com/showlab/Show-o에서 공개되었다.

English

This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Show-o2: 개선된 네이티브 통합 멀티모달 모델

Show-o2: Improved Native Unified Multimodal Models

초록

Support