OmniVideo-R1：通过查询意图与模态注意力增强音视频推理能力

摘要

尽管人类通过多种协同运作的感知模态来认知世界，从而实现对环境的整体理解，但现有的全模态视频模型在视听理解任务中仍面临重大挑战。本文提出OmniVideo-R1这一新型强化框架，通过两项关键策略提升混合模态推理能力：基于自监督学习范式的密集查询定位机制，以及建立在对比学习范式之上的模态注意力融合机制。在多个基准测试上的广泛实验表明，OmniVideo-R1持续超越强基线模型，彰显了其有效性与强大的泛化能力。

English

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

OmniVideo-R1：通过查询意图与模态注意力增强音视频推理能力

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

摘要

Support