OmniVideo-R1: 쿼리 의도 및 모달리티 어텐션을 통한 오디오-비주얼 추론 강화

초록

인간이 주변 환경을 종합적으로 이해하기 위해 상호 시너지적으로 작용하는 다양한 양식으로 세계를 인지하는 것과는 대조적으로, 기존의 올니비디오 모델은 여전히 시청각 이해 과제에서 상당한 어려움에 직면해 있습니다. 본 논문에서는 혼합 양식 추론 능력을 향상시키는 새로운 강화 프레임워크인 OmniVideo-R1을 제안합니다. OmniVideo-R1은 두 가지 핵심 전략을 통해 모델이 "올니모달 단서로 사고"할 수 있도록 합니다: (1) 자기 지도 학습 패러다임 기반의 질의 집중 기초화; (2) 대조 학습 패러다임 위에 구축된 양식 주의적 융합. 다양한 벤치마크에서 수행한 폭넓은 실험을 통해 OmniVideo-R1이 강력한 기준 모델들을 지속적으로 능가하며, 그 효과성과 강력한 일반화 능력을 입증하였습니다.

English

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

OmniVideo-R1: 쿼리 의도 및 모달리티 어텐션을 통한 오디오-비주얼 추론 강화

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

초록

Support