다중 모드 대규모 언어 모델에서 텍스트 표현 주도 추론을 통한 공간 추론 능력 극대화 다중 모드 대규모 언어 모델의 공간 추론 능력 향상을 위한 텍스트 표현 기반 추론 방법론 텍스트 표현 주도 추론을 통한 다중 모드 대규모 언어 모델의 공간 추론 능력 해방

초록

기존의 멀티모달 대규모 언어 모델(MLLM)은 비디오 입력에 묘사된 3D 환경에 대한 구조화된 추상화를 구축하지 못해 3D 공간 추론에 어려움을 겪습니다. 이러한 격차를 해소하기 위해 우리는 인지 과학의 타자 중심적 공간 추론 이론에서 영감을 얻어, MLLM이 비디오의 텍스트 기반 공간 표현을 모델링하고 추론할 수 있도록 하는 방법을 연구합니다. 구체적으로, 우리는 자기 중심적 비디오에서 타자 중심적 맥락의 텍스트 표현(TRACE)을 소개합니다. 이는 MLLM이 보다 정확한 공간 질의 응답을 위한 중간 추론 흔적으로 3D 환경의 텍스트 기반 표현을 생성하도록 유도하는 프롬프팅 방법입니다. TRACE는 메타 맥락, 카메라 궤적, 상세한 객체 엔티티를 인코딩하여 자기 중심적 비디오에 대한 구조화된 공간 추론을 지원합니다. VSI-Bench와 OST-Bench에서 수행한 광범위한 실험을 통해 TRACE가 다양한 매개변수 규모와 학습 스키마를 아우르는 여러 MLLM 백본에서 기존 프롬프팅 전략 대비 뚜렷하고 일관된 성능 향상을 가져옴을 입증합니다. 또한 우리의 설계 선택을 검증하기 위한 절제 연구와 MLLM의 3D 공간 추론 병목 현상을 탐구하는 상세 분석을 추가로 제시합니다.

English

Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

초록

Support