OmniVinci:增强架构与数据以实现全模态理解 大语言模型
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
October 17, 2025
作者: Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov
cs.AI
摘要
推进机器智能的发展,需要培养跨多模态的感知能力,正如人类感知世界的方式。我们推出OmniVinci项目,旨在构建一个强大、开源的全模态大语言模型。我们深入研究了模型架构与数据筛选的设计选择。在模型架构方面,我们提出了三项关键创新:(i) OmniAlignNet,用于增强视觉与音频嵌入在全模态共享潜在空间中的对齐;(ii) 时间嵌入分组,捕捉视觉与音频信号间的相对时间对齐;(iii) 受限旋转时间嵌入,在全模态嵌入中编码绝对时间信息。我们引入了一套筛选与合成流程,生成了2400万条单模态及全模态对话。研究发现,各模态在感知与推理中相互强化。我们的模型OmniVinci在DailyOmni(跨模态理解)上超越Qwen2.5-Omni达19.05分,在MMAR(音频)上提升1.7分,在Video-MME(视觉)上增加3.9分,而仅使用了0.2万亿训练token,相比Qwen2.5-Omni的1.2万亿减少了6倍。最后,我们展示了全模态在机器人、医疗AI及智能工厂等下游应用中的优势。
English
Advancing machine intelligence requires developing the ability to perceive
across multiple modalities, much as humans sense the world. We introduce
OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We
carefully study the design choices across model architecture and data curation.
For model architecture, we present three key innovations: (i) OmniAlignNet for
strengthening alignment between vision and audio embeddings in a shared
omni-modal latent space; (ii) Temporal Embedding Grouping for capturing
relative temporal alignment between vision and audio signals; and (iii)
Constrained Rotary Time Embedding for encoding absolute temporal information in
omni-modal embeddings. We introduce a curation and synthesis pipeline that
generates 24M single-modal and omni-modal conversations. We find that
modalities reinforce one another in both perception and reasoning. Our model,
OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal
understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while
using just 0.2T training tokens - a 6 times reduction compared to
Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream
applications spanning robotics, medical AI, and smart factory.