시각적 근거 기반 사고

초록

시각적 사고는 단지 논리적으로 옳게 들리는 것에 그쳐서는 안 되며, 그 증거를 시각적으로 보여줄 수 있어야 한다. 최근의 시각-언어 모델(VLM)은 자연어 추론 과정을 생성할 수 있지만, 이러한 추론 과정은 종종 뒷받침하는 이미지 영역을 암시적으로만 남겨두어 검증이 어렵고 지도 학습을 적용하기 힘들게 만든다. 본 논문에서는 시각적 근거 기반 사고(visually grounded thinking)를 소개한다. 이는 모델이 각 추론 단계에서 사용된 시각적 증거에 대한 명시적인 점(point) 또는 박스(box) 근거 정보(grounding)를 자연어 사고 과정 사이에 삽입하는 추론 방식이다. 이를 통해 모델은 중간 추론 과정을 언어로 표현하면서도, 핵심 객체가 참조하는 이미지 영역에 명시적으로 근거를 둘 수 있다. 이러한 행동을 학습시키기 위해, 우리는 확장 가능한 합성 파이프라인을 구축한다. 이 파이프라인은 올바른 시각적 추론 과정을 증류(distill)하고, 추론 과정에 필요한 시각적 객체를 추출한 뒤, SAM3 기반 에이전트를 사용하여 이를 이미지에 근거시키고, 결과 마스크로부터 정렬된 점 및 박스 지도 신호를 도출한다. 또한, 우리는 근거 인식 강화 학습(grounding-aware reinforcement learning)을 제안한다. 이는 정답 정확도 보상과 함께, 생성된 객체 참조가 올바른 이미지 증거와 일치하는지를 평가하는 조밀한 근거 보상(dense grounding reward)을 결합한다. 두 개의 개수 세기(counting) 벤치마크와 네 개의 공간 추론(spatial reasoning) 벤치마크에 걸쳐, Gemma3-4B-IT에 시각적 근거 기반 사고를 추가하면 원본 모델 및 근거가 없는 사고(non-grounded thinking) 기준선에 비해 성능이 일관되게 향상되었다. 공간 추론의 경우, 시각적 근거 기반 사고를 적용한 4B 모델은 동일 모델 계열의 Gemma3-27B-IT와 성능이 일치하거나, 일부 경우 이를 능가했다. 분석 결과, 점 근거 방식은 개수 세기 작업에 매우 적합한 반면, 박스 근거 방식은 공간 작업에서 명시적인 근거 보상의 혜택을 가장 많이 받는 것으로 나타났다. 전반적으로, 본 연구 결과는 VLM의 중간 사고 과정이 해당 추론을 참으로 만드는 이미지 영역과 연결될 때, 더 나은 추론 성능을 보여준다는 것을 입증한다.

English

Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.