ChatPaper.aiChatPaper

標記扭曲技術助力多模態大語言模型實現近距離視角觀察

Token Warping Helps MLLMs Look from Nearby Viewpoints

April 3, 2026
作者: Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung
cs.AI

摘要

相較於像素扭曲,扭曲標記是否能協助多模態大型語言模型(MLLMs)理解場景在鄰近視角下的樣貌?儘管MLLMs在視覺推理任務上表現優異,其對視角變化的適應力仍顯脆弱——像素級扭曲對微小深度誤差極度敏感,且常導致幾何畸變。基於心理意象理論主張「部件層級的結構化表徵是人類視角轉換基礎」的觀點,我們探討ViT架構MLLMs中的圖像標記能否作為有效的視角變換媒介。透過比較正向與反向扭曲法,我們發現反向標記扭曲(在目標視角定義密集網格,並為每個網格點擷取對應的源視角標記)能實現更高穩定性,並在視角轉換時更好地維持語義連貫性。在我們提出的ViewBench基準測試中,實驗結果表明:標記層級的扭曲能使MLLMs從鄰近視角進行可靠推理,其表現一致優於所有基準方法(包含像素級扭曲、空間微調的MLLMs以及生成式扭曲方法)。
English
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.
PDF222April 7, 2026