何を、どのように想像するか？――クロスビュー空間推論のための統合的マルチモーダルモデルにおける視覚的思考

要旨

クロスビュー空間推論は視覚言語モデル（VLM）にとって依然として弱点であり、言語で推論する傾向があり、タスクに必要な詳細な幾何学的情報を失う。Thinking with Imagesは、中間的な思考画像を生成することでこの問題に対処しようとするが、最近の研究では、モデルがこれらのトレース内の視覚的証拠を無視することが多いことが示されている。そこで、本研究では、いかにして視覚的思考を重要にし、どのような種類の視覚的思考が最も効果的かを問う。我々は、画像とテキストのインターリーブ生成をネイティブでサポートする統合マルチモーダルモデル（UMM）を用いてこれらの問いを研究する。最初の問いに対し、我々はView Dropout（VDrop）を提案する。これは、入力ビューの一部を回答スパンから隠蔽しつつ、思考画像トークンからは可視のままにする訓練時介入である。これにより、モデルが入力ビューのみに依存するのではなく、思考画像を用いて回答するよう促される。回答予測に思考画像が利用されるようになった後、どのタイプの視覚的思考が最も効果的かを研究する。我々はこれを学習可能性と情報提供性のトレードオフとして捉え、トップダウン、パノラマ、点対応レンダリングの3つの思考画像バリアントを比較する。合成シーンで訓練し、5つの実世界のドメイン外ベンチマークで評価した結果、VDropを伴うパノラマ視覚的思考のみが、情報提供性と学習可能性の両方を満たす唯一の構成であり、最善のドメイン外汎化性能を達成した。

English

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.