ChatPaper.aiChatPaper

OmniVerifier-M1:具有显式结构重新校准的多模态元验证器

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

May 27, 2026
作者: Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang
cs.AI

摘要

视觉结果在多模态大语言模型中日益占据核心地位,这使得可靠且细粒度的验证对于扩展通用基础模型至关重要。本文研究了多模态元验证方法,该方法利用验证器生成的推理依据而非仅依赖决策信号,并探索如何有效将元验证反馈整合到多模态验证器训练中。我们发现两个关键结论:第一,符号化验证器输出(如边界框)作为元验证推理依据优于文本解释,能在避免依赖辅助评判模型的模型奖励的同时,实现高效的基于规则的强化学习奖励;第二,针对二元判断和元验证目标进行解耦强化学习,由于输出结构和学习动态的内在差异,其效果显著优于联合奖励优化。基于这些发现,我们训练了OmniVerifier-M1——一种采用符号化元验证和解耦强化学习的通用视觉验证器。OmniVerifier-M1提供稳健的验证和细粒度错误定位,并进一步实现了M1-TTS(一种验证器驱动的智能体式生成系统),该系统具备动态区域级自我修正能力。该方法为更可靠、可解释且细粒度的多模态验证铺平了道路,支持更安全、更可控的基础模型部署。
English
Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.