OmniVerifier-M1:具有顯式結構化再校準的多模態元驗證器
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration
May 27, 2026
作者: Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang
cs.AI
摘要
在多模態大型語言模型中,視覺輸出結果愈來愈成為核心要素,因此可靠且細粒度的驗證對於擴展通用基礎模型至關重要。本研究探討多模態元驗證——此方法運用驗證器生成的推理依據,而非僅依賴決策信號——並研究如何有效將元驗證反饋融入多模態驗證器的訓練過程。我們發現兩項關鍵結果。首先,符號化驗證器輸出(例如邊界框)作為元驗證推理依據時,表現優於文字解釋,能在避免依賴輔助判別模型之模型基礎獎勵的同時,實現高效的基於規則的強化學習獎勵。其次,將二元判斷與元驗證的強化學習目標解耦,其表現遠優於聯合獎勵優化,原因在於輸出結構與學習動態的本質差異。基於這些洞見,我們訓練出OmniVerifier-M1——一個通用視覺驗證器,採用符號化元驗證與解耦強化學習。OmniVerifier-M1提供穩健的驗證與細粒度的錯誤定位,並進一步催生M1-TTS,這是一個由驗證器驅動的代理生成系統,可實現動態區域層級的自動修正。此方法為更可靠、更可解釋且更細粒度的多模態驗證鋪平道路,有助於實現更安全、更可控的基礎模型部署。
English
Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.