OmniVinci:提升架構與數據以實現全模態理解 LLM
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
October 17, 2025
作者: Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov
cs.AI
摘要
推進機器智能的發展,需要培養跨越多種模態的感知能力,正如人類感知世界的方式。我們推出OmniVinci計劃,旨在構建一個強大、開源的全模態大語言模型(LLM)。我們深入研究了模型架構與數據策劃的設計選擇。在模型架構方面,我們提出了三項關鍵創新:(i) OmniAlignNet,用於在共享的全模態潛在空間中強化視覺與音頻嵌入的對齊;(ii) 時間嵌入分組,用於捕捉視覺與音頻信號之間的相對時間對齊;以及(iii) 約束旋轉時間嵌入,用於在全模態嵌入中編碼絕對時間信息。我們引入了一個策劃與合成流程,生成了2400萬條單模態與全模態的對話數據。我們發現,不同模態在感知與推理上相互增強。我們的模型OmniVinci,在DailyOmni(跨模態理解)上超越Qwen2.5-Omni達19.05分,在MMAR(音頻)上提升1.7分,在Video-MME(視覺)上提升3.9分,而僅使用了0.2T的訓練token,相比Qwen2.5-Omni的1.2T減少了6倍。最後,我們展示了全模態在機器人、醫療AI及智能工廠等下游應用中的優勢。
English
Advancing machine intelligence requires developing the ability to perceive
across multiple modalities, much as humans sense the world. We introduce
OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We
carefully study the design choices across model architecture and data curation.
For model architecture, we present three key innovations: (i) OmniAlignNet for
strengthening alignment between vision and audio embeddings in a shared
omni-modal latent space; (ii) Temporal Embedding Grouping for capturing
relative temporal alignment between vision and audio signals; and (iii)
Constrained Rotary Time Embedding for encoding absolute temporal information in
omni-modal embeddings. We introduce a curation and synthesis pipeline that
generates 24M single-modal and omni-modal conversations. We find that
modalities reinforce one another in both perception and reasoning. Our model,
OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal
understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while
using just 0.2T training tokens - a 6 times reduction compared to
Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream
applications spanning robotics, medical AI, and smart factory.