トレーサブルな証拠を強化した視覚的基盤推論：評価と方法論

要旨

OpenAI-o3のようなモデルは、人間の「イメージを用いた思考」と同様に、視覚領域を動的に参照することで視覚に基づく推論を先駆けています。しかし、これらの能力を包括的に評価するベンチマークは存在しません。このギャップを埋めるため、私たちはTreeBench（Traceable Evidence Evaluation Benchmark）を提案します。これは、以下の3つの原則に基づいて構築された診断用ベンチマークです：(1) 複雑なシーンにおける微妙なターゲットへの集中した視覚的知覚、(2) バウンディングボックス評価による追跡可能な証拠、(3) 単純な物体位置特定を超えた物体間の相互作用や空間的階層をテストするための二次推論。密集した物体を含む画像を優先し、SA-1Bから1,000枚の高品質な画像を初期サンプリングし、8人のLMM専門家が各画像に対して質問、候補オプション、回答を手動で注釈しました。3段階の品質管理を経て、TreeBenchは405組の挑戦的な視覚質問応答ペアで構成されており、最も先進的なモデルでさえこのベンチマークに苦戦し、60%の精度に達するものはありません（例：OpenAI-o3は54.87しか得点できません）。さらに、私たちはTreeVGR（Traceable Evidence Enhanced Visual Grounded Reasoning）を導入します。これは、強化学習を用いて位置特定と推論を共同で監督するトレーニングパラダイムであり、正確な位置特定と説明可能な推論経路を可能にします。Qwen2.5-VL-7Bから初期化されたTreeVGRは、V* Bench（+16.8）、MME-RealWorld（+12.6）、TreeBench（+13.4）で改善を示し、追跡可能性が視覚に基づく推論を進歩させる鍵であることを証明しています。コードはhttps://github.com/Haochen-Wang409/TreeVGRで公開されています。

English

Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.

トレーサブルな証拠を強化した視覚的基盤推論：評価と方法論

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

要旨

Support