マルチモーダル大規模言語モデルにおける空間推論の解放：テキスト表現誘導型推論によるアプローチ

要旨

既存のマルチモーダル大規模言語モデル（MLLM）は、3次元空間推論において課題を抱えている。映像入力に描かれた3D環境の構造化された抽象化を構築できないためである。この課題を解決するため、我々は認知科学におけるアロセントリック空間推論理論に着想を得て、MLLMが映像に対するテキストベースの空間表現をモデル化し推論する手法を探求する。具体的には、TRACE（Textual Representation of Allocentric Context from Egocentric Video）と呼ばれるプロンプティング手法を提案する。この手法は、MLLMに3D環境のテキストベース表現を中間推論痕跡として生成させ、より正確な空間的質問応答を実現する。TRACEは、メタコンテキスト、カメラ軌道、詳細なオブジェクト実体を符号化し、エゴセントリック映像に対する構造化された空間推論を支援する。VSI-BenchとOST-Benchを用いた大規模な実験により、TRACEが様々なパラメータ規模と学習スキームを跨ぐ多様なMLLMバックボーンにおいて、従来のプロンプティング戦略を一貫して顕著に上回る改善をもたらすことが実証された。さらに、設計選択を検証するためのアブレーションスタディと、MLLMの3D空間推論におけるボトルネックを探る詳細な分析を提示する。

English

Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.

マルチモーダル大規模言語モデルにおける空間推論の解放：テキスト表現誘導型推論によるアプローチ

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

要旨

Support