SIMS-V:面向空间视频理解的模拟指令调优技术
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
November 6, 2025
作者: Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie
cs.AI
摘要
尽管多模态语言模型在高层视频理解方面表现卓越,但其跨时空的空间推理能力仍存在不足。当前的空间训练方法主要依赖真实世界视频数据,然而获取具有精确空间标注的多样化影像素材仍是主要瓶颈。为突破这一限制,我们提出SIMS-V——一种系统化的数据生成框架,通过利用三维模拟器的特权信息,为多模态语言模型创建富含空间信息的视频训练数据。基于该框架,我们通过对问题类型、混合方式和规模进行系统性消融实验,探究模拟数据的哪些特性能够有效驱动真实世界的知识迁移。研究发现,仅需三类核心问题(度量测算、视角依赖推理和时序追踪)即可构建最高效的可迁移空间智能培养方案,其效果优于全面覆盖式训练,且所需问题类型更少。这些发现实现了高效训练:我们基于2.5万条模拟数据微调的70亿参数视频大语言模型,不仅超越了720亿参数基线模型,更在严谨的真实世界空间推理基准测试中与专有模型性能相当。该方法展现出强大的泛化能力,在保持通用视频理解性能的同时,在具身交互和真实世界空间任务上实现显著提升。
English
Despite impressive high-level video comprehension, multimodal language models
struggle with spatial reasoning across time and space. While current spatial
training approaches rely on real-world video data, obtaining diverse footage
with precise spatial annotations remains a bottleneck. To alleviate this
bottleneck, we present SIMS-V -- a systematic data-generation framework that
leverages the privileged information of 3D simulators to create spatially-rich
video training data for multimodal language models. Using this framework, we
investigate which properties of simulated data drive effective real-world
transfer through systematic ablations of question types, mixes, and scales. We
identify a minimal set of three question categories (metric measurement,
perspective-dependent reasoning, and temporal tracking) that prove most
effective for developing transferable spatial intelligence, outperforming
comprehensive coverage despite using fewer question types. These insights
enable highly efficient training: our 7B-parameter video LLM fine-tuned on just
25K simulated examples outperforms the larger 72B baseline and achieves
competitive performance with proprietary models on rigorous real-world spatial
reasoning benchmarks. Our approach demonstrates robust generalization,
maintaining performance on general video understanding while showing
substantial improvements on embodied and real-world spatial tasks.