V-Zero: 面向细粒度视觉推理的无答案标签在线策略蒸馏与对比证据门控

摘要

细粒度视觉推理需要多模态大语言模型（MLLMs）识别与任务相关的视觉证据，并将其推理过程锚定在局部图像区域。现有基于智能体的方法通常依赖带可验证奖励的强化学习或在大规模标注推理轨迹上进行监督微调，这导致了昂贵的探索过程、手动设计的验证规则或对文本监督的严重依赖。避免此类外部答案标签的自然方式是让学生自身采样的轨迹进行学习，这指向了在线策略蒸馏（OPD）。为理解OPD在视觉推理中的能力与局限，我们将其重新阐释为无负样本的停止梯度对齐。该视角表明，尽管OPD提供了有效的令牌级校正，但其性能上限受限于缺乏轨迹级判别能力。受此观察启发，我们提出V-Zero——一种无需答案标签、基于对比证据门控的视觉推理框架。V-Zero不使用任何标注文本答案标签，而是在训练过程中将问题相关的区域裁剪图与负视觉视图配对，以评估学生采样的轨迹并调控密集令牌级蒸馏过程。在多个视觉推理基准上的实验表明，V-Zero持续提升了细粒度视觉推理能力，同时保持了强大的泛化性能。值得注意的是，V-Zero的训练速度比以往监督微调方法快5倍以上，比强化学习基线快10倍以上。代码与数据集将在https://github.com/eVI-group-SCU/V-Zero发布。

English

Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5times faster than previous supervised fine-tuning methods and more than 10times faster than reinforcement learning baselines. Code and dataset will be released at https://github.com/eVI-group-SCU/V-Zero