VideoMolmo：时空定位与指向技术的融合

摘要

时空定位对于跨多个领域的精确交互至关重要，从生物研究到自主导航及交互界面。当前基于视频的方法虽然在追踪方面表现出色，却缺乏大型语言模型所具备的复杂推理能力，这限制了其上下文理解与泛化能力。我们推出了VideoMolmo，一个专为基于文本描述的细粒度时空指向而定制的大型多模态模型。基于Molmo架构，VideoMolmo引入了一个时间模块，利用注意力机制将每一帧与前序帧相条件，确保时间一致性。此外，我们新颖的时间掩码融合管道采用SAM2进行双向点传播，显著提升了视频序列间的连贯性。这一两步分解策略——首先利用大语言模型生成精确指向坐标，随后依赖序列掩码融合模块生成连贯分割——不仅简化了语言模型的任务，还增强了可解释性。鉴于缺乏合适的数据集，我们精心构建了一个包含72,000个视频-字幕对、标注有100,000个物体点的综合数据集。为评估VideoMolmo的泛化能力，我们推出了VPoS-Bench，一个涵盖五个现实场景的挑战性分布外基准：细胞追踪、第一人称视角、自动驾驶、视频-GUI交互及机器人学。我们还在参考视频对象分割（Refer-VOS）和推理VOS任务上评估了模型。与现有模型相比，VideoMolmo在时空指向准确性和推理能力上均有显著提升。我们的代码和模型已公开于https://github.com/mbzuai-oryx/VideoMolmo。

English

Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.

VideoMolmo：时空定位与指向技术的融合

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

摘要

Support