ChatPaper.aiChatPaper

Molmo2:具备视频理解与定位能力的视觉语言模型开源权重与数据集

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

January 15, 2026
作者: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
cs.AI

摘要

目前最强大的视频语言模型(VLMs)仍为专有技术。最强的开源权重模型要么依赖专有VLMs生成的合成数据(本质上是其知识蒸馏产物),要么未公开其训练数据与方案。这导致开源社区缺乏改进当前最先进视频(及图像)语言模型的基础。关键在于,许多下游应用不仅需要高层次视频理解能力,更需具备像素级的指向或追踪定位能力——即便是专有模型也尚未实现这一功能。我们推出Molmo2系列VLMs,该系列在开源模型中达到顶尖水平,并在单图像、多图像及视频任务中展现出卓越的新型指向定位能力。我们的核心贡献在于构建了7个新视频数据集与2个多图像数据集,包括用于预训练的高细节视频描述数据集、用于微调的自由形式视频问答数据集、含复杂查询的新物体追踪数据集,以及创新的视频指向数据集——所有数据均未使用闭源VLMs采集。我们还提出了采用高效打包与消息树编码方案的数据训练方法,证明视觉标记的双向注意力机制与新型标记权重策略可提升性能。我们的顶尖80亿参数模型在短视频、计数和描述任务上超越同类开源权重与数据模型,在长视频任务中表现相当。在视频定位方面,Molmo2显著优于Qwen3-VL等开源模型(视频计数准确率35.5对29.6),并在某些任务上超越Gemini 3 Pro等专有模型(视频指向F1分数38.4对20.0,视频追踪J&F分数56.2对41.1)。
English
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
PDF150January 17, 2026