옴니 모델에서의 컨텍스트 언롤링

초록

본 논문에서는 텍스트, 이미지, 비디오, 3D 기하학, 숨겨진 표현 등 다양한 양식을 기본적으로 학습한 통합 다중모달 모델인 Omni를 제안합니다. 이러한 학습 방식은 모델이 예측을 생성하기 전에 여러 양식 표현을 명시적으로 추론하는 '맥락 전개(Context Unrolling)'를 가능하게 하는 것으로 확인되었습니다. 이 과정을 통해 모델은 이질적 양식 간 상보적 정보를 집계하여 공유 다중모달 지식 다양체를 더 정확하게 근사하고 하류 과제의 추론 정확도를 향상시킵니다. 그 결과 Omni는 다중모달 생성 및 이해 벤치마크에서 강력한 성능을 달성함과 동시에 텍스트, 이미지, 비디오, 3D 기하학의 맥락 내 생성을 포함한 고급 다중모달 추론 능력을 입증하였습니다.

English

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

옴니 모델에서의 컨텍스트 언롤링

Context Unrolling in Omni Models

초록

Support