VideoMolmo:时空定位与指向技术的融合
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
June 5, 2025
作者: Ghazi Shazan Ahmad, Ahmed Heakl, Hanan Gani, Abdelrahman Shaker, Zhiqiang Shen, Ranjay Krishna, Fahad Shahbaz Khan, Salman Khan
cs.AI
摘要
时空定位对于跨多个领域的精确交互至关重要,从生物研究到自主导航及交互界面。当前基于视频的方法虽然在追踪方面表现出色,却缺乏大型语言模型所具备的复杂推理能力,这限制了其上下文理解与泛化能力。我们推出了VideoMolmo,一个专为基于文本描述的细粒度时空指向而定制的大型多模态模型。基于Molmo架构,VideoMolmo引入了一个时间模块,利用注意力机制将每一帧与前序帧相条件,确保时间一致性。此外,我们新颖的时间掩码融合管道采用SAM2进行双向点传播,显著提升了视频序列间的连贯性。这一两步分解策略——首先利用大语言模型生成精确指向坐标,随后依赖序列掩码融合模块生成连贯分割——不仅简化了语言模型的任务,还增强了可解释性。鉴于缺乏合适的数据集,我们精心构建了一个包含72,000个视频-字幕对、标注有100,000个物体点的综合数据集。为评估VideoMolmo的泛化能力,我们推出了VPoS-Bench,一个涵盖五个现实场景的挑战性分布外基准:细胞追踪、第一人称视角、自动驾驶、视频-GUI交互及机器人学。我们还在参考视频对象分割(Refer-VOS)和推理VOS任务上评估了模型。与现有模型相比,VideoMolmo在时空指向准确性和推理能力上均有显著提升。我们的代码和模型已公开于https://github.com/mbzuai-oryx/VideoMolmo。
English
Spatio-temporal localization is vital for precise interactions across diverse
domains, from biological research to autonomous navigation and interactive
interfaces. Current video-based approaches, while proficient in tracking, lack
the sophisticated reasoning capabilities of large language models, limiting
their contextual understanding and generalization. We introduce VideoMolmo, a
large multimodal model tailored for fine-grained spatio-temporal pointing
conditioned on textual descriptions. Building upon the Molmo architecture,
VideoMolmo incorporates a temporal module utilizing an attention mechanism to
condition each frame on preceding frames, ensuring temporal consistency.
Additionally, our novel temporal mask fusion pipeline employs SAM2 for
bidirectional point propagation, significantly enhancing coherence across video
sequences. This two-step decomposition, i.e., first using the LLM to generate
precise pointing coordinates, then relying on a sequential mask-fusion module
to produce coherent segmentation, not only simplifies the task for the language
model but also enhances interpretability. Due to the lack of suitable datasets,
we curate a comprehensive dataset comprising 72k video-caption pairs annotated
with 100k object points. To evaluate the generalization of VideoMolmo, we
introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five
real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving,
Video-GUI Interaction, and Robotics. We also evaluate our model on Referring
Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to
existing models, VideoMolmo substantially improves spatio-temporal pointing
accuracy and reasoning capability. Our code and models are publicly available
at https://github.com/mbzuai-oryx/VideoMolmo.