MolmoPoint：基于定位标记的视觉语言模型精准指向增强技术

摘要

定位能力已成为视觉语言模型（VLM）的基础功能。现有大多数模型通过生成坐标作为文本输出的一部分进行指向，这种方法需要学习复杂的坐标系且会导致标记数量激增。我们提出了一种更直观的指向机制，直接选择包含目标概念的视觉标记。我们的模型会生成一种特殊指向标记，通过交叉注意力机制对输入图像或视频标记进行筛选。为实现更精细的定位，我们在初始指向标记后引入第二级特殊标记用于在选定区域内选择细粒度子区块，再通过第三级标记精确定位于区块内的具体位置。研究进一步表明，采用固定顺序的序列化指向、编码前次选定点的相对位置，以及在选择视觉标记时加入"终止指向"特殊类别，可有效提升性能。该方法在图像指向任务上创造了新标杆（PointBench数据集达70.7%），在图形界面指向任务中成为全开源模型的新标杆（ScreenSpotPro数据集达61.1%），同时显著提升视频指向（相较文本坐标基线获得59.1%的人类偏好胜率）与追踪性能（Molmo2Track数据集提升6.3%）。我们还验证了该方法具有更高的样本效率，并深入探讨了这种设计变革带来的质性差异。

English

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

MolmoPoint：基于定位标记的视觉语言模型精准指向增强技术

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

摘要

Support