生成型ユニバーサル検証器としてのマルチモーダルメタ推論器

要旨

本論文では、次世代のマルチモーダル推論における視覚言語モデルおよび統合型マルチモーダルモデルのための新たな概念およびプラグインであるGenerative Universal Verifierを紹介する。これは、推論および生成プロセス中に視覚的結果に対する反省と洗練の基本的な能力を提供するものである。本研究は以下の3つの主要な貢献を行う：(1) マルチモーダル推論における視覚的結果を評価するための16のカテゴリにわたる包括的なベンチマークであるViVerBenchを構築した。結果は、既存の視覚言語モデル（VLMs）がこれらのタスクにおいて一貫して低い性能を示し、信頼性のある視覚的検証における人間レベルの能力との大きな隔たりを浮き彫りにした。(2) 大規模な視覚的検証データを構築し、OmniVerifier-7Bを訓練するための2つの自動化パイプラインを設計した。OmniVerifier-7Bは、普遍的な視覚的検証のために訓練された初のオムニ能力を持つ生成検証器であり、ViVerBenchにおいて顕著な向上（+8.3）を達成した。訓練を通じて、視覚的検証における3つの基本的な能力を特定し、それらがどのように一般化し、相乗的に相互作用するかを示した。(3) 普遍的な検証器を活用して、統合モデル内での画像生成と編集を橋渡しする逐次的なテストタイムスケーリングパラダイムであるOmniVerifier-TTSを提案した。これにより、反復的な細粒度最適化を通じて生成能力の上限を向上させた。生成を超えて、普遍的な検証器をより広範な世界モデリングと交差する推論シナリオに拡張した。実験的に、OmniVerifier-TTSはT2I-ReasonBench（+3.7）およびGenEval++（+4.3）において改善を達成し、Best-of-Nなどの既存の並列テストタイムスケーリング手法を上回った。信頼性のある視覚的検証をマルチモーダル推論に付与することにより、OmniVerifierは生成中の信頼性のある反省とスケーラブルなテストタイム洗練の両方を進化させ、より信頼性と制御性の高い次世代推論システムへの一歩を記した。

English

We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

生成型ユニバーサル検証器としてのマルチモーダルメタ推論器

Generative Universal Verifier as Multimodal Meta-Reasoner

要旨

Support