토큰 와핑, MLLM이 근접 시점에서 바라보도록 돕다

초록

픽셀이 아닌 토큰을 왜곡하는 것이 다중모드 대규모 언어 모델(MLLM)으로 하여금 근접 시점에서 장면이 어떻게 보이는지 이해하는 데 도움이 될 수 있을까? MLLM은 시각적 추론에서 우수한 성능을 보이지만, 픽셀 단위 왜곡은 작은 깊이 오차에 매우 민감하고 기하학적 왜곡을 자주 초래하기 때문에 시점 변화에 취약한 상태로 남아 있다. 인간의 관점 변환의 기초로 부분-수준 구조적 표현을 가정하는 정신적 심상 이론에 기반하여, 우리는 ViT 기반 MLLM의 이미지 토큰이 시점 변화를 위한 효과적인 기반으로 작용하는지 조사한다. 우리는 순방향 및 역방향 왜곡을 비교한 결과, 대상 뷰에 조밀한 그리드를 정의하고 각 그리드 포인트에 대해 해당 소스-뷰 토큰을 검색하는 역방향 토큰 왜곡이 더 큰 안정성을 달성하고 시점 변화 하에서 의미적 일관성을 더 잘 보존한다는 것을 발견했다. 우리가 제안한 ViewBench 벤치마크에 대한 실험은 토큰-수준 왜곡이 MLLM으로 하여금 근접 시점에서 안정적으로 추론할 수 있게 하며, 픽셀 단위 왜곡 접근법, 공간적으로 미세 조정된 MLLM, 생성적 왜곡 방법을 포함한 모든 기준선을 일관되게 능가한다는 것을 보여준다.

English

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

토큰 와핑, MLLM이 근접 시점에서 바라보도록 돕다

Token Warping Helps MLLMs Look from Nearby Viewpoints

초록

Support