トークンワーピングによりMLLMが近傍視点から観察できるようになる

要旨

画素ではなくトークンをワープさせることで、マルチモーダル大規模言語モデル（MLLM）は近接視点からのシーン認識を実現できるだろうか？視覚的推論において優れた性能を示すMLLMであるが、視点変化に対しては依然として脆弱である。これは画素単位のワーピングが微小な深度誤差に敏感であり、幾何学的歪みを生じやすいためである。人間の視点変換が部分レベルの構造的表現に基づくという心理イメージ理論に着想を得て、我々はViTベースのMLLMにおける画像トークンが視点変化の効果的な基盤となり得るかを検証する。前方ワーピングと後方ワーピングを比較した結果、目標視点に高密度グリッドを定義し各グリッド点に対応する原視点トークンを取得する後方トークンワーピングが、視点変化下でより高い安定性と意味的一貫性の保持を実現することを見出した。提案するViewBenchベンチマークによる実験では、トークンレベルのワーピングがMLLMに近接視点からの信頼性高い推論を可能にし、画素単位ワーピング手法、空間的ファインチューニングされたMLLM、生成的ワーピング手法を含む全てのベースラインを一貫して凌駕することを実証した。

English

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

トークンワーピングによりMLLMが近傍視点から観察できるようになる

Token Warping Helps MLLMs Look from Nearby Viewpoints

要旨

Support