生成式通用验证器作为多模态元推理器
Generative Universal Verifier as Multimodal Meta-Reasoner
October 15, 2025
作者: Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang
cs.AI
摘要
我们提出了生成式通用验证器这一创新概念与插件,专为下一代视觉-语言模型及统一多模态模型中的多模态推理设计,提供了在推理与生成过程中对视觉结果进行反思与优化的基础能力。本研究主要贡献有三:(1) 构建了ViVerBench,一个涵盖16类关键任务的综合基准,用于评估多模态推理中的视觉输出。结果显示,现有视觉-语言模型在这些任务上普遍表现不佳,凸显了与人类可靠视觉验证能力间的显著差距。(2) 设计了两条自动化流水线,用于构建大规模视觉验证数据并训练OmniVerifier-7B,这是首个具备全方位能力的生成式验证器,专为通用视觉验证而训练,在ViVerBench上取得了显著提升(+8.3)。通过训练,我们识别出视觉验证中的三项基本能力,并展示了它们如何协同泛化与交互。(3) 提出了OmniVerifier-TTS,一种序列化测试时扩展范式,利用通用验证器在统一模型内桥接图像生成与编辑,通过迭代细粒度优化提升生成能力的上限。除生成外,我们将通用验证器扩展至更广泛的世界模型交织推理场景。实证表明,OmniVerifier-TTS在T2I-ReasonBench(+3.7)和GenEval++(+4.3)上取得改进,超越了如Best-of-N等现有并行测试时扩展方法。通过赋予多模态推理可靠的视觉验证能力,OmniVerifier推动了生成过程中的可靠反思与可扩展的测试时优化,标志着向更可信、可控的下一代推理系统迈进了一步。
English
We introduce Generative Universal Verifier, a novel concept and plugin
designed for next-generation multimodal reasoning in vision-language models and
unified multimodal models, providing the fundamental capability of reflection
and refinement on visual outcomes during the reasoning and generation process.
This work makes three main contributions: (1) We build ViVerBench, a
comprehensive benchmark spanning 16 categories of critical tasks for evaluating
visual outcomes in multimodal reasoning. Results show that existing VLMs
consistently underperform across these tasks, underscoring a substantial gap
from human-level capability in reliable visual verification. (2) We design two
automated pipelines to construct large-scale visual verification data and train
OmniVerifier-7B, the first omni-capable generative verifier trained for
universal visual verification and achieves notable gains on ViVerBench(+8.3).
Through training, we identify three atomic capabilities in visual verification
and demonstrate how they generalize and interact synergistically. (3) We
propose OmniVerifier-TTS, a sequential test-time scaling paradigm that
leverages the universal verifier to bridge image generation and editing within
unified models, enhancing the upper bound of generative ability through
iterative fine-grained optimization. Beyond generation, we extend universal
verifier to broader world-modeling interleaved reasoning scenarios.
Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7),
and GenEval++(+4.3), outperforming existing parallel test-time scaling methods,
such as Best-of-N. By endowing multimodal reasoning with reliable visual
verification, OmniVerifier advances both reliable reflection during generation
and scalable test-time refinement, marking a step toward more trustworthy and
controllable next-generation reasoning systems.