Show-o2: Verbesserte native multimodale Modelle mit einheitlicher Architektur

papers.abstract

Dieses Papier stellt verbesserte native, einheitliche multimodale Modelle vor, nämlich Show-o2, die autoregressives Modellieren und Flow Matching nutzen. Basierend auf einem 3D-kausalen Variationsautoencoder-Raum werden einheitliche visuelle Repräsentationen durch einen dualen Pfad der räumlichen (-zeitlichen) Fusion konstruiert, was Skalierbarkeit über Bild- und Video-Modalitäten hinweg ermöglicht und gleichzeitig effektives multimodales Verständnis und Generierung sicherstellt. Aufbauend auf einem Sprachmodell werden autoregressives Modellieren und Flow Matching nativ auf den Sprachkopf bzw. den Flow-Kopf angewendet, um die Vorhersage von Text-Tokens und die Generierung von Bildern/Videos zu erleichtern. Ein zweistufiges Trainingsrezept wurde entwickelt, um effektives Lernen und Skalierung auf größere Modelle zu ermöglichen. Die resultierenden Show-o2-Modelle zeigen Vielseitigkeit bei der Bewältigung einer breiten Palette von multimodalen Verständnis- und Generierungsaufgaben über verschiedene Modalitäten hinweg, einschließlich Text, Bildern und Videos. Code und Modelle sind unter https://github.com/showlab/Show-o veröffentlicht.

English

This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Show-o2: Verbesserte native multimodale Modelle mit einheitlicher Architektur

Show-o2: Improved Native Unified Multimodal Models

papers.abstract

Support