CodeV:基于工具感知策略优化的图像编码可信视觉推理
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
November 24, 2025
作者: Xinhai Hou, Shaoyuan Xu, Manan Biyani, Mayan Li, Jia Liu, Todd C. Hollon, Bryan Wang
cs.AI
摘要
近年來,具備行動能力的視覺語言模型逐漸通過調用圖像操作來實現「以圖像思考」。然而我們發現,最終答案的高準確率往往掩蓋了視覺推理的不忠實性:模型可能對無關區域調用工具,或完全忽略工具輸出,卻仍能猜出正確答案。本研究首先提出忠實性評估方案,通過檢驗中間視覺工具輸出(如圖像裁剪區域)是否實際包含查詢證據來量化這一問題。分析表明,儘管當前視覺智能體在視覺搜索基準測試中取得高最終準確率,但其工具使用的忠實度普遍偏低。為此我們推出CodeV——基於代碼的視覺智能體,採用工具感知策略優化(TAPO)進行訓練。TAPO是一種進程級強化學習框架,在GRPO基礎上引入直接作用於視覺工具輸入輸出的密集獎勵信號(而非思維鏈標記),使監督更易驗證且能有效規避獎勵破解。CodeV將視覺工具具象化為可執行Python代碼,TAPO則僅根據問題與工具輸出分配逐步獎勵,從而促進必要且符合證據的工具使用。在兩階段SFT+RL訓練框架下,CodeV在相關視覺搜索基準上不僅實現了競爭力強的最終準確率,更顯著提升工具使用忠實度。此外,CodeV在多模態推理與數學基準測試中亦表現優異,證明對中間工具行為的顯式監督對於構建可信賴的主動式視覺推理系統具有關鍵意義。
English
Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.