OmniVideo-R1：通过查询意图与模态注意力增强音视频推理能力

摘要

人类通过多模态协同感知世界，形成对环境的整体认知，然而现有全视频模型在视听理解任务中仍面临重大挑战。本文提出OmniVideo-R1——一种通过强化学习提升多模态推理能力的新型框架。该框架通过两大核心策略使模型具备"全模态线索思考"能力：基于自监督学习范式的密集查询定位技术，以及构建于对比学习范式之上的模态注意力融合机制。在多个基准测试上的实验表明，OmniVideo-R1持续超越强基线模型，彰显了其卓越的有效性与鲁棒的泛化能力。

English

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.