어떻게 그리고 무엇을 상상할까? 교차 시점 공간 추론을 위한 통합 다중 모달 모델에서의 시각적 사고

초록

교차 시점 공간 추론은 시각-언어 모델(VLM)의 약점으로 남아 있다. 이러한 모델들은 종종 언어적으로 추론할 뿐, 작업에 필요한 정밀한 기하학적 정보를 놓치곤 한다. 생각과 함께 이미지를 활용하는 접근법은 중간 사고 이미지를 생성함으로써 이 문제를 해결하고자 하지만, 최근 연구에 따르면 모델들이 이러한 과정에서 시각적 증거를 무시하는 경우가 많다. 따라서 우리는 시각적 사고가 실질적으로 활용되도록 만드는 방법과 어떤 유형의 시각적 사고가 가장 효과적인지 탐구한다. 우리는 이러한 질문을 이미지-텍스트 혼합 생성을 기본적으로 지원하는 통합 멀티모달 모델(UMM) 환경에서 연구한다. 첫 번째 질문에 대해 우리는 시점 드롭아웃(VDrop)을 제안한다. 이는 훈련 시 개입 방식으로, 입력 시점 중 일부를 응답 구간에서는 숨기되 사고 이미지 토큰에서는 볼 수 있게 유지한다. 이를 통해 모델이 입력 시점에만 의존하지 않고 사고 이미지를 활용하여 응답하도록 유도한다. 사고 이미지가 응답 예측에 사용된다면, 어떤 유형의 시각적 사고가 가장 효과적인지 연구한다. 우리는 이를 학습 가능성-정보성 상충 관계로 구성하고, 세 가지 사고 이미지 변형(탑다운, 파노라마, 점 매칭 렌더링)을 비교한다. 합성 장면에서 훈련하고 다섯 가지 실제 도메인 외(out-of-domain) 벤치마크에서 평가한 결과, VDrop과 결합된 파노라마 시각적 사고가 정보성과 학습 가능성을 모두 갖춘 유일한 구성이며, 최고의 도메인 외 일반화 성능을 달성한다.

English

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.