ChatPaper.aiChatPaper

SIMS-V:面向空间视频理解的模拟指令调优

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

November 6, 2025
作者: Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie
cs.AI

摘要

尽管多模态语言模型在高层级视频理解方面表现卓越,但其跨时空的空间推理能力仍有不足。当前的空间训练方法主要依赖真实世界视频数据,然而获取具有精确空间标注的多样化影像资料仍是瓶颈。为突破此限制,我们提出SIMS-V——一个系统化的数据生成框架,通过利用三维模拟器的特权信息,为多模态语言模型创建富含空间信息的视频训练数据。借助该框架,我们通过系统性地解构问题类型、混合方式和规模,探究模拟数据的哪些特性能够有效驱动现实世界的迁移应用。我们发现仅需三类核心问题(度量测算、视角依赖推理和时序追踪)即可最有效地培养可迁移的空间智能,其效果甚至优于全面覆盖多种问题类型的方案。这些发现实现了高效训练:基于仅2.5万组模拟样本微调的70亿参数视频大语言模型,不仅超越720亿参数的更大基线模型,更在严谨的现实空间推理基准测试中与专有模型性能相当。我们的方法展现出强大的泛化能力,在保持通用视频理解性能的同时,在具身交互和现实空间任务上实现显著提升。
English
Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
PDF42December 2, 2025