CodeV:基于图像编程的可靠视觉推理——工具感知策略优化方法
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
November 24, 2025
作者: Xinhai Hou, Shaoyuan Xu, Manan Biyani, Mayan Li, Jia Liu, Todd C. Hollon, Bryan Wang
cs.AI
摘要
智能视觉语言模型正日益通过调用图像操作来实现"以图思考"。然而我们发现,最终答案的高准确率往往掩盖了视觉推理的不忠实性:模型可能对无关区域调用工具,或完全忽略工具输出,却仍能猜出正确答案。本研究首先提出忠实性评估方案,通过检测中间视觉工具输出(如图像裁剪区域)是否真正包含查询证据来衡量推理可靠性。实验表明,当前视觉智能体在视觉搜索基准测试中虽获得高准确率,但工具使用的忠实率普遍偏低。
为此,我们推出CodeV——基于代码的视觉智能体,采用工具感知策略优化(TAPO)进行训练。TAPO是一种流程级强化学习框架,在GRPO基础上引入直接作用于视觉工具输入输出的密集奖励机制,而非思维链标记。这种监督方式更易验证且能有效规避奖励破解问题。CodeV将视觉工具实现为可执行Python代码,TAPO仅根据问题和工具输出分配逐步奖励,从而促进必要且证据一致的工具使用。
通过两阶段SFT+RL训练流程,CodeV在相关视觉搜索基准上不仅实现具有竞争力的准确率,更显著提升工具使用忠实率。除视觉搜索外,CodeV在多模态推理和数学基准测试中同样表现优异,证明对中间工具行为进行显式监督对于构建可信赖的智能视觉推理系统具有关键意义。
English
Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.