Omni-R1：基於雙系統協作的全模態推理強化學習

摘要

長時序視頻音頻推理與細粒度像素理解對全模態模型提出了相互矛盾的要求：密集的時間覆蓋需要大量低分辨率幀，而精確的定位則需高分辨率輸入。我們採用雙系統架構來應對這一權衡：全局推理系統選擇信息豐富的關鍵幀並以低空間成本重寫任務，而細節理解系統則在選定的高分辨率片段上執行像素級定位。由於“最優”關鍵幀選擇與任務重構具有模糊性且難以監督，我們將其構建為強化學習（RL）問題，並提出了基於群組相對策略優化的端到端RL框架——Omni-R1。Omni-R1通過與細節理解系統在線協作獲取的分層獎勵來訓練全局推理系統，僅需在小任務劃分上進行一輪RL訓練。在兩個具有挑戰性的基準測試——參考音視頻分割（RefAVS）和推理視頻對象分割（REVOS）上的實驗表明，Omni-R1不僅超越了強監督基線，還優於專門的現有最先進模型，同時顯著提升了跨域泛化能力並減少了多模態幻覺。我們的成果展示了RL在大規模全模態推理中的首次成功應用，並為通向通用基礎模型的可擴展路徑提供了重要啟示。

English

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.