ChatPaper.aiChatPaper

词元扭曲技术助力多模态大模型从邻近视角进行观察

Token Warping Helps MLLMs Look from Nearby Viewpoints

April 3, 2026
作者: Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung
cs.AI

摘要

扭曲标记而非像素,能否帮助多模态大语言模型(MLLM)理解场景在邻近视角下的样貌?尽管MLLM在视觉推理任务中表现优异,但它们对视角变化仍显脆弱——像素级扭曲对微小深度误差极为敏感,且常引入几何畸变。基于心理意象理论中"以部件级结构表征作为人类视角变换基础"的观点,我们探究ViT架构MLLM中的图像标记是否可作为视角变换的有效载体。通过对比前向与后向扭曲策略,发现后向标记扭曲(在目标视角定义密集网格并为每个网格点检索源视角对应标记)能实现更高稳定性,并在视角转换中更好地保持语义连贯性。在我们提出的ViewBench基准测试中,实验表明标记级扭曲使MLLM能够从邻近视角进行可靠推理,其表现一致优于所有基线方法(包括像素级扭曲方案、空间微调MLLM以及生成式扭曲方法)。
English
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.
PDF222April 7, 2026