視覚的グラウンディングによる思考

要旨

視覚的思考は単に言語的に正しいだけでなく、その根拠を示すべきである。近年のビジョンランゲージモデル（VLM）は自然言語による推論過程を生成できるが、これらの過程はしばしば対応する画像領域を暗黙のままにし、検証や監督が困難である。本稿では、視覚的に根拠付けられた思考（visually grounded thinking）を導入する。これは、モデルが自然言語による思考と、各ステップで使用される視覚的根拠の明示的な点またはボックスによる根拠付けを交互に配置する推論過程である。これにより、モデルは中間推論を言語で表現しつつ、参照する画像領域内の主要オブジェクトを根拠付けできる。この振る舞いを学習するために、正しい視覚的推論過程を抽出し、その過程に必要な視覚オブジェクトを抽出し、SAM3ベースのエージェントで根拠付けを行い、得られたマスクから整合する点とボックスの監督信号を導出するスケーラブルな合成パイプラインを構築する。さらに、正解報酬と、生成されたオブジェクト参照が正しい画像根拠と一致するかを評価する密な根拠付け報酬を組み合わせた、根拠付け認識強化学習を提案する。2つのカウントベンチマークと4つの空間推論ベンチマークにおいて、Gemma3-4B-ITに視覚的根拠付け思考を追加することで、元のモデルおよび根拠付けなし思考ベースラインと比較して一貫して性能が向上する。空間推論では、視覚的根拠付け思考を備えた4Bモデルが、同じモデルファミリーのGemma3-27B-ITに匹敵し、場合によってはそれを上回る。分析により、点根拠付けはカウントタスクに適しており、ボックス根拠付けは空間タスクにおいて明示的な根拠付け報酬から最も恩恵を受けることが示された。全体として、本結果は、中間思考がそれを真とする画像領域に結び付けられているとき、VLMはより良く思考することを示している。

English

Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.