V-Zero: 回答ラベル不要のオン方策蒸留と対比的証拠ゲーティングによる細粒度視覚推論

要旨

細粒度の視覚的推論では、マルチモーダル大規模言語モデル（MLLM）がタスクに関連する視覚的証拠を特定し、その推論を画像の局所領域に基づかせることが求められる。既存のエージェント的手法は、通常、検証可能な報酬を用いた強化学習や、大規模な注釈付き推論トレースに対する教師ありファインチューニングに依存しており、コストのかかる探索、手動設計の検証ルール、またはテキストによる監督への強い依存を招いている。このような外部の回答ラベルを回避する自然な方法は、生徒自身がサンプリングした軌跡から学習することであり、これはオン方策蒸留（OPD）へとつながる。OPDが視覚的推論に何を提供でき、何を提供できないかを理解するために、我々はこれをネガティブフリーなストップグラディエントアライメントとして再考する。この観点から、OPDは効果的なトークンレベルの修正を提供するものの、その上限は軌跡レベルの識別が欠如していることによって制約されることが示される。これらの観察に動機づけられ、我々はV-Zeroを提案する。これは対比的証拠ゲーティングを用いた視覚的推論のための回答ラベル不要フレームワークである。V-Zeroは注釈付きテキスト回答ラベルを使用せず、代わりに訓練中に質問に関連する領域クロップとネガティブな視覚ビューをペアにして、生徒がサンプリングした軌跡を評価し、密なトークンレベルの蒸留をゲーティングする。複数の視覚的推論ベンチマークでの実験により、V-Zeroが強力な汎化性能を維持しながら、細粒度の視覚的推論を一貫して改善することが示される。注目すべきことに、V-Zeroは従来の教師ありファインチューニング手法よりも5倍以上速く、強化学習ベースラインよりも10倍以上速い。コードとデータセットは https://github.com/eVI-group-SCU/V-Zero で公開予定である。

English

Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5times faster than previous supervised fine-tuning methods and more than 10times faster than reinforcement learning baselines. Code and dataset will be released at https://github.com/eVI-group-SCU/V-Zero