ChatPaper.aiChatPaper

Omni-R1:迈向多模态推理的统一生成范式

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

January 14, 2026
作者: Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, Yupeng Hu, Wenjie Wang, Liqiang Nie, Wenjie Li
cs.AI

摘要

多模态大语言模型(MLLMs)在多模态推理领域正取得显著进展。早期方法主要关注纯文本推理,而近期研究虽在推理步骤中融入了多模态信息,但往往遵循单一任务特定的推理模式,这限制了其在各类多模态任务中的泛化能力。事实上,众多多模态任务需要多样化的推理技能,例如聚焦图像特定区域或标记目标物体。为此,我们提出统一生成式多模态推理方法,通过在推理过程中生成中间图像来整合多种多模态推理技能。我们通过两阶段SFT+RL框架Omni-R1实现这一范式,该框架采用感知对齐损失和感知奖励机制,从而实现功能性图像生成。此外,我们还推出Omni-R1-Zero,通过从纯文本推理数据中自举步进式可视化内容,无需多模态标注即可实现推理。实验结果表明,Omni-R1在广泛的多模态任务中实现了统一生成式推理,而Omni-R1-Zero在整体性能上可媲美甚至超越Omni-R1,这为生成式多模态推理指明了富有前景的发展方向。
English
Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.
PDF11January 16, 2026