生成式通用驗證器作為多模態元推理器
Generative Universal Verifier as Multimodal Meta-Reasoner
October 15, 2025
作者: Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang
cs.AI
摘要
我們提出了生成式通用驗證器這一新穎概念及插件,專為下一代視覺語言模型和統一多模態模型中的多模態推理而設計,在推理和生成過程中提供了對視覺結果進行反思與精煉的基礎能力。本工作主要貢獻有三:(1) 我們構建了ViVerBench,這是一個涵蓋16類關鍵任務的綜合基準,用於評估多模態推理中的視覺結果。結果顯示,現有的視覺語言模型在這些任務上普遍表現不佳,凸顯了與人類在可靠視覺驗證能力上的顯著差距。(2) 我們設計了兩條自動化流水線來構建大規模視覺驗證數據,並訓練了OmniVerifier-7B,這是首個為通用視覺驗證訓練的全能生成式驗證器,在ViVerBench上取得了顯著提升(+8.3)。通過訓練,我們識別了視覺驗證中的三項基本能力,並展示了它們如何協同泛化與互動。(3) 我們提出了OmniVerifier-TTS,這是一種序列化測試時擴展範式,利用通用驗證器在統一模型內橋接圖像生成與編輯,通過迭代的細粒度優化提升生成能力的上限。除生成外,我們還將通用驗證器擴展至更廣泛的世界建模交錯推理場景。實證表明,OmniVerifier-TTS在T2I-ReasonBench(+3.7)和GenEval++(+4.3)上取得了改進,超越了現有的並行測試時擴展方法,如Best-of-N。通過賦予多模態推理可靠的視覺驗證能力,OmniVerifier不僅提升了生成過程中的可靠反思,還實現了可擴展的測試時精煉,標誌著向更可信、更可控的下一代推理系統邁進了一步。
English
We introduce Generative Universal Verifier, a novel concept and plugin
designed for next-generation multimodal reasoning in vision-language models and
unified multimodal models, providing the fundamental capability of reflection
and refinement on visual outcomes during the reasoning and generation process.
This work makes three main contributions: (1) We build ViVerBench, a
comprehensive benchmark spanning 16 categories of critical tasks for evaluating
visual outcomes in multimodal reasoning. Results show that existing VLMs
consistently underperform across these tasks, underscoring a substantial gap
from human-level capability in reliable visual verification. (2) We design two
automated pipelines to construct large-scale visual verification data and train
OmniVerifier-7B, the first omni-capable generative verifier trained for
universal visual verification and achieves notable gains on ViVerBench(+8.3).
Through training, we identify three atomic capabilities in visual verification
and demonstrate how they generalize and interact synergistically. (3) We
propose OmniVerifier-TTS, a sequential test-time scaling paradigm that
leverages the universal verifier to bridge image generation and editing within
unified models, enhancing the upper bound of generative ability through
iterative fine-grained optimization. Beyond generation, we extend universal
verifier to broader world-modeling interleaved reasoning scenarios.
Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7),
and GenEval++(+4.3), outperforming existing parallel test-time scaling methods,
such as Best-of-N. By endowing multimodal reasoning with reliable visual
verification, OmniVerifier advances both reliable reflection during generation
and scalable test-time refinement, marking a step toward more trustworthy and
controllable next-generation reasoning systems.