オムニモデルにおけるコンテキスト展開

要旨

本論文では、テキスト、画像、動画、3Dジオメトリ、隠れ表現など多様なモダリティをネイティブに学習した統一マルチモーダルモデル「Omni」を提案する。このような学習により、モデルは予測を行う前に複数のモーダル表現を明示的に推論する「文脈展開（Context Unrolling）」が可能となることがわかった。このプロセスにより、モデルは異種モダリティ間の補完的情報を統合し、共有されたマルチモーダル知識多様体をより忠実に近似することで、下流課題の推論精度が向上する。その結果、Omniはマルチモーダル生成と理解の両ベンチマークで強力な性能を発揮するとともに、テキスト、画像、動画、3Dジオメトリの文脈内生成を含む高度なマルチモーダル推論能力を示す。

English

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

オムニモデルにおけるコンテキスト展開

Context Unrolling in Omni Models

要旨

Support