CodeV: ツールを意識した政策最適化による忠実な視覚推論のための画像を用いたコーディング

要旨

エージェンシック視覚言語モデルは、画像操作を呼び出すことで「画像を用いて思考する」ように訓練されることが増えている。しかし、高い最終回答精度が、しばしば不忠実な視覚的推論を隠蔽していることを我々は明らかにする。すなわち、モデルは無関係な領域に対してツールを起動したり、ツールの出力を完全に無視したりしても、依然として正しい答えを推測してしまう可能性がある。本研究ではまず、中間的な視覚ツールの出力（例：画像の切り抜き）が実際に問い合わせられた証拠を含んでいるかどうかを測定する、忠実性評価プロトコルを提案する。これにより、最近の視覚エージェントは高い最終回答精度を達成するものの、視覚的検索ベンチマークにおいて忠実なツール使用率が低いことが明らかになった。次に、我々はCodeVを紹介する。これは、Tool-Aware Policy Optimization (TAPO) で訓練されたコードベースの視覚エージェントである。TAPOは、プロセスレベルでの強化学習フレームワークであり、思考連鎖トークンではなく、視覚ツールの入力と出力に直接定義された密な報酬でGRPOを拡張する。これにより、監督が検証しやすくなり、報酬ハッキングの影響を受けにくくなる。CodeVは視覚ツールを実行可能なPythonコードとして表現し、TAPOは質問とツール出力のみに基づいてステップごとの報酬を割り当て、必要かつ証拠と整合性のあるツール使用を促進する。2段階（SFT+RL）のパイプラインにおいて、CodeVは、関連する視覚的検索ベンチマークで忠実なツール使用率を大幅に向上させながら、競争力のある、あるいは優れた精度を達成した。視覚的検索を超えて、CodeVは様々なマルチモーダル推論および数学のベンチマークで強力な性能を発揮し、中間ツールの振る舞いを明示的に監督することが、信頼できるエージェンシックな視覚推論システムを構築する上で極めて重要であることを示唆している。

English

Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

CodeV: ツールを意識した政策最適化による忠実な視覚推論のための画像を用いたコーディング

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

要旨

Support