如何想像與想像什麼？統一多模態模型中的視覺思維對跨視角空間推理之探討

摘要

交叉視圖空間推理仍是視覺語言模型（VLM）的弱點：它們常依賴語言推理，卻遺失了任務所需的細緻幾何資訊。為此，「以圖思考」旨在透過生成中間思考圖像來解決問題，但近期研究顯示，模型往往忽略這些痕跡中的視覺證據。因此，我們探討如何讓視覺思維真正發揮作用，以及何種視覺思維最為有效。我們在統一多模態模型（UMM）中研究這些問題，該模型原生支援交錯的圖文生成。針對第一個問題，我們提出「視圖丟棄法」（VDrop），這是一種訓練時介入手段，在答案生成區段隱藏部分輸入視圖，同時使其仍對思考圖像的詞元可見。此舉鼓勵模型在回答時使用思考圖像，而非僅依賴輸入視圖。一旦思考圖像被用於答案預測，我們進一步研究哪種類型的視覺思維最有效。我們將其框架為「可學習性與資訊量」之間的權衡，並比較三種思考圖像變體：自上而下、全景及點匹配渲染。在以合成場景訓練並於五個真實世界跨域基準測試評估後，採用視圖丟棄法的全景視覺思維是唯一兼具資訊量與可學習性的配置，並達成了最佳的跨域泛化表現。

English

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.