OmniVideo-R1:通过查询意图与模态注意力增强音视频推理能力
OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
February 5, 2026
作者: Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang
cs.AI
摘要
人类通过多模态协同感知世界,形成对环境的整体认知,然而现有全视频模型在视听理解任务中仍面临重大挑战。本文提出OmniVideo-R1——一种通过强化学习提升多模态推理能力的新型框架。该框架通过两大核心策略使模型具备"全模态线索思考"能力:基于自监督学习范式的密集查询定位技术,以及构建于对比学习范式之上的模态注意力融合机制。在多个基准测试上的实验表明,OmniVideo-R1持续超越强基线模型,彰显了其卓越的有效性与鲁棒的泛化能力。
English
While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.