Kontextentfaltung in Omni-Modellen

Zusammenfassung

Wir stellen Omni vor, ein vereinheitlichtes multimodales Modell, das nativ auf verschiedenen Modalitäten trainiert wurde, darunter Text, Bilder, Videos, 3D-Geometrie und latente Repräsentationen. Wir beobachten, dass ein solches Training Kontextentfaltung ermöglicht, bei der das Modell explizit über mehrere modale Repräsentationen hinweg Schlussfolgerungen zieht, bevor es Vorhersagen trifft. Dieser Prozess befähigt das Modell, komplementäre Informationen über heterogene Modalitäten hinweg zu aggregieren, was eine treuere Annäherung an die gemeinsame multimodale Wissensmannigfaltigkeit erleichtert und die Schlussfolgertreue nachgelagerter Aufgaben verbessert. Infolgedessen erzielt Omni starke Leistungen in Benchmarks sowohl für multimodale Generierung als auch für multimodales Verständnis und demonstriert dabei fortgeschrittene multimodale Reasoning-Fähigkeiten, einschließlich kontextbasierter Generierung von Text, Bildern, Videos und 3D-Geometrie.

English

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

Kontextentfaltung in Omni-Modellen

Context Unrolling in Omni Models

Zusammenfassung

Support