OmniVideo-R1:通过查询意图与模态注意力增强音视频推理能力
OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
February 5, 2026
作者: Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang
cs.AI
摘要
尽管人类通过多种协同运作的感知模态来认知世界,从而实现对环境的整体理解,但现有的全模态视频模型在视听理解任务中仍面临重大挑战。本文提出OmniVideo-R1这一新型强化框架,通过两项关键策略提升混合模态推理能力:基于自监督学习范式的密集查询定位机制,以及建立在对比学习范式之上的模态注意力融合机制。在多个基准测试上的广泛实验表明,OmniVideo-R1持续超越强基线模型,彰显了其有效性与强大的泛化能力。
English
While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.