V-Zero: 정답 레이블 없이 온-정책 증류와 대조 증거 게이팅을 통한 세분화된 시각적 추론

초록

미세 시각 추론은 다중 모달 대규모 언어 모델(MLLM)이 작업 관련 시각적 증거를 식별하고, 지역 이미지 영역에 기반한 추론을 수행하도록 요구한다. 기존의 에이전트 기반 방법들은 일반적으로 검증 가능한 보상을 통한 강화 학습이나 대규모 주석 추론 경로에 대한 지도 미세 조정에 의존하여, 비용이 많이 드는 탐색, 수작업 검증 규칙, 또는 텍스트 감독에 대한 과도한 의존성을 초래한다. 이러한 외부 정답 레이블을 피하는 자연스러운 방법은 학생 모델 자체가 샘플링한 궤적으로부터 학습하는 것이며, 이는 정책 내 증류(OPD)를 의미한다. OPD가 시각 추론에 대해 제공할 수 있는 것과 없는 것을 이해하기 위해, 우리는 이를 부정-없음 정지-그래디언트 정렬로 재검토한다. 이러한 관점은 OPD가 효과적인 토큰 수준 교정을 제공하지만, 궤적 수준 식별의 부재로 인해 그 한계가 제약된다는 것을 보여준다. 이러한 관찰에 기초하여, 우리는 대조 증거 게이팅을 통한 시각 추론을 위한 정답-레이블-없음 프레임워크인 V-Zero를 제안한다. V-Zero는 주석된 텍스트 정답 레이블을 사용하지 않으며, 대신 훈련 중에 질문 관련 지역 크롭을 부정적 시각 뷰와 짝지어 학생이 샘플링한 궤적을 평가하고 밀집 토큰 수준 증류를 게이팅한다. 여러 시각 추론 벤치마크에 대한 실험 결과, V-Zero가 미세 시각 추론을 일관되게 개선하면서도 강력한 일반화를 유지함을 보여준다. 특히, V-Zero는 기존 지도 미세 조정 방법보다 5배 이상 빠르고, 강화 학습 기준선보다 10배 이상 빠르다. 코드와 데이터셋은 https://github.com/eVI-group-SCU/V-Zero에서 공개될 예정이다.

English

Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5times faster than previous supervised fine-tuning methods and more than 10times faster than reinforcement learning baselines. Code and dataset will be released at https://github.com/eVI-group-SCU/V-Zero