全模型中的上下文展开

摘要

我们推出Omni——一个原生支持文本、图像、视频、3D几何及隐式表征等多模态统一训练的模型。研究发现，这种训练方式可触发"语境展开"机制，使模型在生成预测前能显式地对多模态表征进行联合推理。该机制促使模型聚合异构模态间的互补信息，更精准地逼近共享的多模态知识流形，从而提升下游任务的推理可信度。实验表明，Omni在多模态生成与理解基准测试中均表现优异，并展现出包括文本、图像、视频及3D几何的语境生成在内的先进多模态推理能力。

English

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

全模型中的上下文展开

Context Unrolling in Omni Models

摘要

Support