ChatPaper.aiChatPaper

Omni-R1:通过双系统协作实现全模态推理的强化学习

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

May 26, 2025
作者: Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, Chunhua Shen
cs.AI

摘要

长时程视频-音频推理与细粒度像素理解对全模态模型提出了相互矛盾的要求:密集的时间覆盖需要大量低分辨率帧,而精确的定位则要求高分辨率输入。我们通过双系统架构来解决这一权衡问题:全局推理系统以较低的空间成本选择信息丰富的关键帧并重写任务,而细节理解系统则在选定的高分辨率片段上执行像素级定位。由于“最优”关键帧选择和任务重构具有模糊性且难以监督,我们将其表述为强化学习(RL)问题,并提出了Omni-R1,这是一个基于群体相对策略优化的端到端RL框架。Omni-R1通过与细节理解系统的在线协作获得分层奖励,从而训练全局推理系统,仅需在小任务划分上进行一轮RL训练。 在两个具有挑战性的基准测试——指称音频-视觉分割(RefAVS)和推理视频对象分割(REVOS)上的实验表明,Omni-R1不仅超越了强大的监督基线,还优于专门的最先进模型,同时显著提高了跨域泛化能力并减轻了多模态幻觉。我们的成果展示了RL在大规模全模态推理中的首次成功应用,并指明了通向通用基础模型的可扩展路径。
English
Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

Summary

AI-Generated Summary

PDF151May 27, 2025