Molmo2：具备视频理解与定位能力的视觉语言模型开放权重及数据集

摘要

当前最强的视频语言模型（VLM）仍为闭源系统。顶尖的开源权重模型要么依赖闭源VLMs生成的合成数据进行知识蒸馏，要么未公开其训练数据与方案。这导致开源社区缺乏改进视频（及图像）语言模型技术前沿的基础支撑。关键在于，许多下游应用不仅需要高层视频理解能力，更需像素级的指向或追踪定位能力——即便是闭源模型也尚未具备此功能。我们推出Molmo2系列VLMs，该模型在开源模型中达到技术前沿水平，并在单图像、多图像及视频任务中展现出卓越的点驱动定位新能力。核心贡献在于构建了7个新视频数据集与2个多图像数据集，包括用于预训练的精细化视频描述数据集、用于微调的自由形式视频问答数据集、含复杂查询的新物体追踪数据集，以及创新的视频指向定位数据集——所有数据均未使用闭源VLMs采集。我们还提出采用高效数据打包与消息树编码方案的训练方案，证明视觉令牌的双向注意力机制与新颖的令牌权重策略可提升性能。我们的8B旗舰模型在短视频、计数和描述任务上超越同类开源权重与数据模型，在长视频任务中表现相当。在视频定位方面，Molmo2显著优于Qwen3-VL等开源模型（视频计数准确率35.5对29.6），并在某些任务上超越Gemini 3 Pro等闭源模型（视频指向F1值38.4对20.0，视频追踪J&F值56.2对41.1）。

English

Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).

Molmo2：具备视频理解与定位能力的视觉语言模型开放权重及数据集

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

摘要

Support