OmniVerifier-M1: 명시적 구조적 재보정을 적용한 다중 모달 메타 검증기

초록

시각적 결과는 다중 모달 대규모 언어 모델에서 점점 더 중심적인 역할을 하고 있으며, 이에 따라 신뢰할 수 있고 세분화된 검증이 범용 기반 모델을 확장하는 데 필수적이 되고 있다. 본 연구에서는 결정 신호만이 아닌 검증기 생성 근거를 활용하는 다중 모달 메타 검증을 조사하고, 메타 검증 피드백을 다중 모달 검증기 훈련에 효과적으로 통합하는 방법을 탐구한다. 우리는 두 가지 핵심 발견을 확인하였다. 첫째, 기호적 검증기 출력(예: 경계 상자)은 메타 검증 근거로서 텍스트 설명보다 우수하며, 보조 판단 모델의 모델 기반 보상에 의존하지 않으면서 효율적인 규칙 기반 강화 학습 보상을 가능하게 한다. 둘째, 이진 판단과 메타 검증을 위한 강화 학습 목표를 분리하는 것이 출력 구조와 학습 역학의 본질적인 차이로 인해 결합 보상 최적화보다 훨씬 우수한 성능을 보인다. 이러한 통찰을 바탕으로 우리는 기호적 메타 검증과 분리된 강화 학습을 활용하는 범용 시각 검증기인 OmniVerifier-M1을 훈련한다. OmniVerifier-M1은 강력한 검증과 세분화된 오류 위치 파악을 제공하며, 나아가 동적 영역 수준 자가 교정을 달성하는 검증기 구동 에이전트 기반 생성 시스템인 M1-TTS를 가능하게 한다. 이러한 접근 방식은 보다 신뢰할 수 있고 해석 가능하며 세분화된 다중 모달 검증을 위한 길을 열어주며, 더 안전하고 제어 가능한 기반 모델 배포를 지원한다.

English

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.